Professional Documents
Culture Documents
Intel® 64 and IA-32 Architectures Software Developer's Manual Documentation Changes 252046-Sdm-Change-Document
Intel® 64 and IA-32 Architectures Software Developer's Manual Documentation Changes 252046-Sdm-Change-Document
Intel® 64 and IA-32 Architectures Software Developer's Manual Documentation Changes 252046-Sdm-Change-Document
Documentation Changes
October 2019
Notice: The Intel® 64 and IA-32 architectures may contain design defects or errors known as errata
that may cause the product to deviate from published specifications. Current characterized errata are
documented in the specification updates.
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Summary Tables of Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Documentation Changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Revision History
• Updated title.
-013 • There are no Documentation Changes for this revision of the July 2005
document.
Affected Documents
Document Number/
Document Title
Location
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture 253665
®
Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A: Instruction Set
253666
Reference, A-L
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B: Instruction Set
253667
Reference, M-U
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2C: Instruction Set
326018
Reference, V-Z
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2D: Instruction Set
334569
Reference
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System
253668
Programming Guide, Part 1
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System
253669
Programming Guide, Part 2
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C: System
326019
Programming Guide, Part 3
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3D: System
332831
Programming Guide, Part 4
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model Specific
335592
Registers
Nomenclature
Documentation Changes include typos, errors, or omissions from the current published specifications. These
will be incorporated in any new release of the specification.
Documentation Changes(Sheet 1 of 2)
No. DOCUMENTATION CHANGES
------------------------------------------------------------------------------------------
Changes to this chapter: Updates to Section 1.1 “Intel® 64 and IA-32 Processors Covered in this Manual”. General
cleanup of Section 1.2 “Overview of Volume 1: Basic Architecture”.
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture (order number
253665) is part of a set that describes the architecture and programming environment of Intel® 64 and IA-32
architecture processors. Other volumes in this set are:
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C & 2D: Instruction Set
Reference (order numbers 253666, 253667, 326018 and 334569).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C & 3D: System
Programming Guide (order numbers 253668, 253669, 326019 and 332831).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model-Specific Registers
(order number 335592).
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, describes the basic architecture
and programming environment of Intel 64 and IA-32 processors. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volumes 2A, 2B, 2C & 2D, describe the instruction set of the processor and the opcode struc-
ture. These volumes apply to application programmers and to programmers who write operating systems or exec-
utives. The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C & 3D, describe
the operating-system support environment of Intel 64 and IA-32 processors. These volumes target operating-
system and BIOS designers. In addition, the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3B, addresses the programming environment for classes of software that host operating systems. The
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4, describes the model-specific registers
of Intel 64 and IA-32 processors.
Vol. 1 1-1
ABOUT THIS MANUAL
1-2 Vol. 1
ABOUT THIS MANUAL
Vol. 1 1-3
ABOUT THIS MANUAL
The Intel® Atom™ processor C series, the Intel® Atom™ processor X series, the Intel® Pentium® processor J
series, the Intel® Celeron® processor J series, and the Intel® Celeron® processor N series are based on the Gold-
mont microarchitecture.
The Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series is based on the Knights Landing microarchitecture and
supports Intel 64 architecture.
The Intel® Pentium® Silver processor series, the Intel® Celeron® processor J series, and the Intel® Celeron®
processor N series are based on the Goldmont Plus microarchitecture.
The 8th generation Intel® Core™ processors, 9th generation Intel® Core™ processors, and Intel® Xeon® E proces-
sors are based on the Coffee Lake microarchitecture and support Intel 64 architecture.
The Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series is based on the Knights Mill microarchitecture and
supports Intel 64 architecture.
The 2nd generation Intel® Xeon® Processor Scalable Family is based on the Cascade Lake product and supports
Intel 64 architecture.
The 10th generation Intel® Core™ processors are based on the Ice Lake microarchitecture and support Intel 64
architecture.
IA-32 architecture is the instruction set architecture and programming environment for Intel's 32-bit microproces-
sors. Intel® 64 architecture is the instruction set architecture and programming environment which is the superset
of Intel’s 32-bit and 64-bit architectures. It is compatible with the IA-32 architecture.
1-4 Vol. 1
ABOUT THIS MANUAL
describes SIMD floating-point exceptions that can be generated with SSE and SSE2 instructions. It also provides
general guidelines for incorporating support for SSE and SSE2 extensions into operating system and applications
code.
Chapter 12 — Programming with Intel® Streaming SIMD Extensions 3 (Intel® SSE3), Supplemental
Streaming SIMD Extensions 3 (SSSE3), Intel® Streaming SIMD Extensions 4 (Intel® SSE4) and Intel®
AES New Instructions (Intel® AES-NI). Provides an overview of the SSE3 instruction set, Supplemental SSE3,
SSE4, AESNI instructions, and guidelines for writing code that access these extensions.
Chapter 13 — Managing State Using the XSAVE Feature Set. Describes the XSAVE feature set instructions
and explains how software can enable the XSAVE feature set and XSAVE-enabled features.
Chapter 14 — Programming with AVX, FMA and AVX2. Provides an overview of the Intel® AVX instruction set,
FMA and Intel AVX2 extensions and gives guidelines for writing code that access these extensions.
Chapter 15 — Programming with Intel® AVX-512. Provides an overview of the Intel® AVX-512 instruction set
extensions and gives guidelines for writing code that access these extensions.
Chapter 16 — Programming with Intel Transactional Synchronization Extensions. Describes the instruc-
tion extensions that support lock elision techniques to improve the performance of multi-threaded software with
contended locks.
Chapter 17 — Intel® Memory Protection Extensions. Provides an overview of the Intel® Memory Protection
Extensions and gives guidelines for writing code that access these extensions.
Chapter 18 — Control-flow Enforcement Technology. Provides an overview of the Control-flow Enforcement
Technology (CET) and gives guidelines for writing code that access these extensions.
Chapter 19 — Input/Output. Describes the processor’s I/O mechanism, including I/O port addressing, I/O
instructions, and I/O protection mechanisms.
Chapter 20 — Processor Identification and Feature Determination. Describes how to determine the CPU
type and features available in the processor.
Appendix A — EFLAGS Cross-Reference. Summarizes how the IA-32 instructions affect the flags in the EFLAGS
register.
Appendix B — EFLAGS Condition Codes. Summarizes how conditional jump, move, and ‘byte set on condition
code’ instructions use condition code flags (OF, CF, ZF, SF, and PF) in the EFLAGS register.
Appendix C — Floating-Point Exceptions Summary. Summarizes exceptions raised by the x87 FPU floating-
point and SSE/SSE2/SSE3 floating-point instructions.
Appendix D — Guidelines for Writing x87 FPU Exception Handlers. Describes how to design and write MS-
DOS* compatible exception handling facilities for FPU exceptions (includes software and hardware requirements
and assembly-language code examples). This appendix also describes general techniques for writing robust FPU
exception handlers.
Appendix E — Guidelines for Writing SIMD Floating-Point Exception Handlers. Gives guidelines for writing
exception handlers for exceptions generated by SSE/SSE2/SSE3 floating-point instructions.
Vol. 1 1-5
ABOUT THIS MANUAL
Data Structure
Highest
Address 32 24 23 16 15 8 7 0 Bit offset
28
24
20
16
12
8
4
Byte 3 Byte 2 Byte 1 Byte 0 0
Lowest
Address
Byte Offset
NOTE
Avoid any software dependence upon the state of reserved bits in Intel 64 and IA-32 registers.
Depending upon the values of reserved register bits will make software dependent upon the
unspecified manner in which the processor handles these bits. Programs that depend upon
reserved values risk incompatibility with future processors.
1-6 Vol. 1
ABOUT THIS MANUAL
For example:
Segment-register:Byte-address
For example, the following segment address identifies the byte at address FF79H in the segment pointed by the DS
register:
DS:FF79H
The following segment address identifies an instruction address in the code segment. The CS register points to the
code segment and the EIP register contains the address of the instruction.
CS:EIP
Vol. 1 1-7
ABOUT THIS MANUAL
CPUID.01H:EDX.SSE[bit 25] = 1
CR4.OSFXSR[bit 9] = 1
Example CR name
IA32_MISC_ENABLE.ENABLEFOPCODE[bit 2] = 1
SDM29002
Figure 1-2. Syntax for CPUID, CR, and MSR Data Presentation
1.3.6 Exceptions
An exception is an event that typically occurs when an instruction causes an error. For example, an attempt to
divide by zero generates an exception. However, some exceptions, such as breakpoints, occur under other condi-
tions. Some types of exceptions may provide error codes. An error code reports additional information about the
error. An example of the notation used to show an exception and error code is shown below:
#PF(fault code)
This example refers to a page-fault exception under conditions where an error code naming a type of fault is
reported. Under some conditions, exceptions that produce error codes may not be able to report an accurate code.
In this case, the error code is zero, as shown below for a general-protection exception:
#GP(0)
1-8 Vol. 1
ABOUT THIS MANUAL
Vol. 1 1-9
ABOUT THIS MANUAL
1-10 Vol. 1
2. Updates to Chapter 5, Volume 1
Change bars and green text show changes to Chapter 5 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1: Basic Architecture.
------------------------------------------------------------------------------------------
Changes to this chapter: General cleanup and updating chapter contents using cross-references to section infor-
mation. Updated Table 5-2 “Instruction Set Extensions Introduction in Intel 64 and IA-32 Processors”. Added
sections on Shadow Stack Management instructions and Control Transfer Terminating instructions.
This chapter provides an abridged overview of Intel 64 and IA-32 instructions. Instructions are divided into the
following groups:
• Section 5.1, “General-Purpose Instructions”.
• Section 5.2, “x87 FPU Instructions”.
• Section 5.3, “x87 FPU AND SIMD State Management Instructions”.
• Section 5.4, “MMX™ Instructions”.
• Section 5.5, “SSE Instructions”.
• Section 5.6, “SSE2 Instructions”.
• Section 5.7, “SSE3 Instructions”.
• Section 5.8, “Supplemental Streaming SIMD Extensions 3 (SSSE3) Instructions”.
• Section 5.9, “SSE4 Instructions”.
• Section 5.10, “SSE4.1 Instructions”.
• Section 5.11, “SSE4.2 Instruction Set”.
• Section 5.12, “Intel® AES-NI and PCLMULQDQ”.
• Section 5.13, “Intel® Advanced Vector Extensions (Intel® AVX)”.
• Section 5.14, “16-bit Floating-Point Conversion”.
• Section 5.15, “Fused-Multiply-ADD (FMA)”.
• Section 5.16, “Intel® Advanced Vector Extensions 2 (Intel® AVX2)”.
• Section 5.17, “Intel® Transactional Synchronization Extensions (Intel® TSX)”.
• Section 5.18, “Intel® SHA Extensions”.
• Section 5.19, “Intel® Advanced Vector Extensions 512 (Intel® AVX-512)”.
• Section 5.20, “System Instructions”.
• Section 5.21, “64-Bit Mode Instructions”.
• Section 5.22, “Virtual-Machine Extensions”.
• Section 5.23, “Safer Mode Extensions”.
• Section 5.24, “Intel® Memory Protection Extensions”.
• Section 5.25, “Intel® Software Guard Extensions”.
• Section 5.26, “Shadow Stack Management Instructions”.
• Section 5.27, “Control Transfer Terminating Instructions”.
Table 5-1 lists the groups and IA-32 processors that support each group. More recent instruction set extensions are
listed in Table 5-2. Within these groups, most instructions are collected into functional subgroups.
Vol. 1 5-1
INSTRUCTION SET SUMMARY
Table 5-2. Instruction Set Extensions Introduction in Intel 64 and IA-32 Processors
Instruction Set
Architecture Processor Generation Introduction
SSE4.1 Extensions Intel® Xeon® processor 3100, 3300, 5200, 5400, 7400, 7500 series, Intel® Core™ 2 Extreme processors
QX9000 series, Intel® Core™ 2 Quad processor Q9000 series, Intel® Core™ 2 Duo processors 8000 series and
T9000 series, Intel Atom® processor based on Silvermont microarchitecture.
SSE4.2 Extensions, Intel® Core™ i7 965 processor, Intel® Xeon® processors X3400, X3500, X5500, X6500, X7500 series, Intel
CRC32, POPCNT Atom processor based on Silvermont microarchitecture.
Intel® AES-NI, Intel® Xeon® processor E7 series, Intel® Xeon® processors X3600 and X5600, Intel® Core™ i7 980X
PCLMULQDQ processor, Intel Atom processor based on Silvermont microarchitecture. Use CPUID to verify presence of
Intel AES-NI and PCLMULQDQ across Intel® Core™ processor families.
Intel® AVX Intel® Xeon® processor E3 and E5 families, 2nd Generation Intel® Core™ i7, i5, i3 processor 2xxx families.
F16C 3rd Generation Intel® Core™ processors, Intel® Xeon® processor E3-1200 v2 product family, Intel® Xeon®
processor E5 v2 and E7 v2 families.
RDRAND 3rd Generation Intel Core processors, Intel Xeon processor E3-1200 v2 product family, Intel Xeon processor
E5 v2 and E7 v2 families, Intel Atom processor based on Silvermont microarchitecture.
FS/GS base access 3rd Generation Intel Core processors, Intel Xeon processor E3-1200 v2 product family, Intel Xeon processor
E5 v2 and E7 v2 families, Intel Atom® processor based on Goldmont microarchitecture.
FMA, AVX2, BMI1, BMI2, Intel® Xeon® processor E3/E5/E7 v3 product families, 4th Generation Intel® Core™ processor family.
INVPCID, LZCNT, Intel®
TSX
MOVBE Intel Xeon processor E3/E5/E7 v3 product families, 4th Generation Intel Core processor family, Intel Atom
processors.
PREFETCHW Intel® Core™ M processor family; 5th Generation Intel® Core™ processor family, Intel Atom processor based
on Silvermont microarchitecture.
5-2 Vol. 1
INSTRUCTION SET SUMMARY
Table 5-2. Instruction Set Extensions Introduction in Intel 64 and IA-32 Processors (Contd.)
Instruction Set
Architecture Processor Generation Introduction
Intel® SHA Extensions Intel Atom processor based on Goldmont microarchitecture.
ADX Intel Core M processor family, 5th Generation Intel Core processor family.
RDSEED, CLAC, STAC Intel Core M processor family, 5th Generation Intel Core processor family, Intel Atom processor based on
Goldmont microarchitecture.
AVX512ER, AVX512PF, Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series.
PREFETCHWT1
AVX512F, AVX512CD Intel Xeon Phi Processor 3200, 5200, 7200 Series, Intel® Xeon® Processor Scalable Family, Intel® Core™ i3-
8121U processor.
CLFLUSHOPT, XSAVEC, Intel Xeon Processor Scalable Family, 6th Generation Intel® Core™ processor family, Intel Atom processor
XSAVES, Intel® MPX based on Goldmont microarchitecture.
SGX1 6th Generation Intel Core processor family, Intel Atom® processor based on Goldmont Plus
microarchitecture.
AVX512DQ, Intel Xeon Processor Scalable Family, Intel Core i3-8121U processor.
AVX512BW, AVX512VL
CLWB Intel Xeon Processor Scalable Family, 10th Generation Intel® Core™ processor family, Intel Atom® processor
based on Tremont microarchitecture.
PKU Intel Xeon Processor Scalable Family.
AVX512_IFMA, Intel Core i3-8121U processor.
AVX512_VBMI
SHA-NI Intel Core i3-8121U processor, Intel Atom processor based on Goldmont microarchitecture.
UMIP Intel Core i3-8121U processor, Intel Atom processor based on Goldmont Plus microarchitecture.
PTWRITE Intel Atom processor based on Goldmont Plus microarchitecture.
RDPID 10th Generation Intel® Core™ processor family, Intel Atom processor based on Goldmont Plus
microarchitecture.
AVX512_4FMAPS, Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series.
AVX512_4VNNIW
AVX512_VNNI 2nd Generation Intel® Xeon® Processor Scalable Family, 10th Generation Intel Core processor family.
AVX512_VPOPCNTDQ Intel Xeon Phi Processor 7215, 7285, 7295 Series, 10th Generation Intel Core processor family.
Fast Short REP MOV 10th Generation Intel Core processor family.
GFNI (SSE) 10th Generation Intel Core processor family, Intel Atom processor based on Tremont microarchitecture.
VAES, 10th Generation Intel Core processor family.
GFNI (AVX/AVX512),
AVX512_VBMI2,
VPCLMULQDQ,
AVX512_BITALG
ENCLV Intel Atom processor based on Tremont microarchitecture.
Split Lock Detection 10th Generation Intel Core processor family, Intel Atom processor based on Tremont microarchitecture.
CLDEMOTE Intel Atom processor based on Tremont microarchitecture.
Direct stores: MOVDIRI, Intel Atom processor based on Tremont microarchitecture.
MOVDIR64B
User wait: TPAUSE, Intel Atom processor based on Tremont microarchitecture.
UMONITOR, UMWAIT
Vol. 1 5-3
INSTRUCTION SET SUMMARY
Table 5-2. Instruction Set Extensions Introduction in Intel 64 and IA-32 Processors (Contd.)
Instruction Set
Architecture Processor Generation Introduction
Control-flow Future Intel Core processors.
Enforcement
Technology (CET)
The following sections list instructions in each major group and subgroup. Given for each instruction is its
mnemonic and descriptive names. When two or more mnemonics are given (for example, CMOVA/CMOVNBE), they
represent different mnemonics for the same instruction opcode. Assemblers support redundant mnemonics for
some instructions to make it easier to read code listings. For instance, CMOVA (Conditional move if above) and
CMOVNBE (Conditional move if not below or equal) represent the same condition. For detailed information about
specific instructions, see the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C
& 2D.
5-4 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-5
INSTRUCTION SET SUMMARY
5-6 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-7
INSTRUCTION SET SUMMARY
5-8 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-9
INSTRUCTION SET SUMMARY
5-10 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-11
INSTRUCTION SET SUMMARY
5-12 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-13
INSTRUCTION SET SUMMARY
5-14 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-15
INSTRUCTION SET SUMMARY
5-16 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-17
INSTRUCTION SET SUMMARY
5-18 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-19
INSTRUCTION SET SUMMARY
5-20 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-21
INSTRUCTION SET SUMMARY
5-22 Vol. 1
INSTRUCTION SET SUMMARY
MOVDDUP Loads/moves 64 bits (bits[63:0] if the source is a register) and returns the same 64 bits in
both the lower and upper halves of the 128-bit result register; duplicates the 64 bits from
the source.
Vol. 1 5-23
INSTRUCTION SET SUMMARY
5-24 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-25
INSTRUCTION SET SUMMARY
BLENDPS Conditionally copies specified single-precision floating-point data elements in the source
operand to the corresponding data elements in the destination, using an immediate byte
control.
BLENDVPD Conditionally copies specified double-precision floating-point data elements in the source
operand to the corresponding data elements in the destination, using an implied mask.
BLENDVPS Conditionally copies specified single-precision floating-point data elements in the source
operand to the corresponding data elements in the destination, using an implied mask.
PBLENDVB Conditionally copies specified byte elements in the source operand to the corresponding
elements in the destination, using an implied mask.
PBLENDW Conditionally copies specified word elements in the source operand to the corresponding
elements in the destination, using an immediate byte control.
5-26 Vol. 1
INSTRUCTION SET SUMMARY
PEXTRQ Extract a qword from an XMM register and insert the value into a general-purpose register
or memory.
Vol. 1 5-27
INSTRUCTION SET SUMMARY
5-28 Vol. 1
INSTRUCTION SET SUMMARY
• Table 14-3 lists 256-bit and 128-bit data movement and processing instructions promoted from legacy 128-bit
SIMD instruction sets.
• Table 14-4 lists functional enhancements of 256-bit AVX instructions not available from legacy 128-bit SIMD
instruction sets.
• Table 14-5 lists 128-bit integer and floating-point instructions promoted from legacy 128-bit SIMD instruction
sets.
• Table 14-6 lists functional enhancements of 128-bit AVX instructions not available from legacy 128-bit SIMD
instruction sets.
• Table 14-7 lists 128-bit data movement and processing instructions promoted from legacy instruction sets.
Vol. 1 5-29
INSTRUCTION SET SUMMARY
5-30 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-31
INSTRUCTION SET SUMMARY
VSCALESD/SS Multiply the low DP/SP FP element of a vector by powers of two with exponent specified in
the corresponding element of a second vector.
VSCATTERDD/DQ Scatter SP/DP FP elements in a vector to memory using dword indices.
VSCATTERQD/QQ Scatter SP/DP FP elements in a vector to memory using qword indices.
VSHUFF32X4/64X2 Shuffle 128-bit lanes of a vector with 32/64 bit granular conditional update.
VSHUFI32X4/64X2 Shuffle 128-bit lanes of a vector with 32/64 bit granular conditional update.
512-bit instruction mnemonics in AVX-512DQ that are not AVX/AVX2 promotions include:
VCVT(T)PD2QQ Convert packed DP FP elements of a vector to packed signed 64-bit integers.
VCVT(T)PD2UQQ Convert packed DP FP elements of a vector to packed unsigned 64-bit integers.
VCVT(T)PS2QQ Convert packed SP FP elements of a vector to packed signed 64-bit integers.
VCVT(T)PS2UQQ Convert packed SP FP elements of a vector to packed unsigned 64-bit integers.
VCVTUQQ2PD/PS Convert packed unsigned 64-bit integers to packed DP/SP FP elements.
VEXTRACTF64X2 Extract a vector from a full-length vector with 64-bit granular update.
VEXTRACTI64X2 Extract a vector from a full-length vector with 64-bit granular update.
VFPCLASSPD/PS Test packed DP/SP FP elements in a vector by numeric/special-value category.
VFPCLASSSD/SS Test the low DP/SP FP element by numeric/special-value category.
VINSERTF64X2 Insert a 128-bit vector into a full-length vector with 64-bit granular update.
VINSERTI64X2 Insert a 128-bit vector into a full-length vector with 64-bit granular update.
VPMOVM2D/Q Convert opmask register to vector register in 32/64-bit granularity.
VPMOVB2D/Q2M Convert a vector register in 32/64-bit granularity to an opmask register.
VPMULLQ Multiply packed signed 64-bit integer elements of two vectors and store low 64-bit signed
result.
VRANGEPD/PS Perform RANGE operation on each pair of DP/SP FP elements of two vectors using specified
range primitive in imm8.
VRANGESD/SS Perform RANGE operation on the pair of low DP/SP FP element of two vectors using speci-
fied range primitive in imm8.
VREDUCEPD/PS Perform Reduction operation on packed DP/SP FP elements of a vector using specified
reduction primitive in imm8.
VREDUCESD/SS Perform Reduction operation on the low DP/SP FP element of a vector using specified
reduction primitive in imm8.
512-bit instruction mnemonics in AVX-512BW that are not AVX/AVX2 promotions include:
VDBPSADBW Double block packed Sum-Absolute-Differences on unsigned bytes.
VMOVDQU8/16 VMOVDQU with 8/16-bit granular conditional update.
VPBLENDMB Replaces the VPBLENDVB instruction (using opmask as select control).
VPBLENDMW Blend word elements using opmask as select control.
VPBROADCASTB/W Broadcast from general-purpose register to vector register.
VPCMPB/UB Compare packed signed/unsigned bytes using specified primitive.
VPCMPW/UW Compare packed signed/unsigned words using specified primitive.
VPERMW Permute packed word elements.
VPERMI2B/W Full permute from two tables of byte/word elements overwriting the index vector.
VPMOVM2B/W Convert opmask register to vector register in 8/16-bit granularity.
VPMOVB2M/W2M Convert a vector register in 8/16-bit granularity to an opmask register.
VPMOV(S|US)WB Down convert word elements in a vector to byte elements using truncation (saturation |
unsigned saturation).
VPSLLVW Shift word elements in a vector left by shift counts in a vector.
VPSRAVW Shift words right by shift counts in a vector and shifting in sign bits.
5-32 Vol. 1
INSTRUCTION SET SUMMARY
512-bit instruction mnemonics in AVX-512CD that are not AVX/AVX2 promotions include:
VPBROADCASTM Broadcast from opmask register to vector register.
VPCONFLICTD/Q Detect conflicts within a vector of packed 32/64-bit integers.
VPLZCNTD/Q Count the number of leading zero bits of packed dword/qword elements.
Vol. 1 5-33
INSTRUCTION SET SUMMARY
5-34 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-35
INSTRUCTION SET SUMMARY
None of the instructions above can be executed in compatibility mode; they generate invalid-opcode exceptions if
executed in compatibility mode.
The behavior of the guest-available instructions is summarized below:
VMCALL Allows a guest in VMX non-root operation to call the VMM for service. A VM exit occurs,
transferring control to the VMM.
VMFUNC This instruction allows software in VMX non-root operation to invoke a VM function, which
is processor functionality enabled and configured by software in VMX root operation. No
VM exit occurs.
5-36 Vol. 1
INSTRUCTION SET SUMMARY
Table 5-3. Supervisor and User Mode Enclave Instruction Leaf Functions in Long-Form of SGX1
Supervisor Instruction Description User Instruction Description
ENCLS[EADD] Add a page ENCLU[EENTER] Enter an Enclave
ENCLS[EBLOCK] Block an EPC page ENCLU[EEXIT] Exit an Enclave
ENCLS[ECREATE] Create an enclave ENCLU[EGETKEY] Create a cryptographic key
ENCLS[EDBGRD] Read data by debugger ENCLU[EREPORT] Create a cryptographic report
ENCLS[EDBGWR] Write data by debugger ENCLU[ERESUME] Re-enter an Enclave
ENCLS[EEXTEND] Extend EPC page measurement
ENCLS[EINIT] Initialize an enclave
ENCLS[ELDB] Load an EPC page as blocked
ENCLS[ELDU] Load an EPC page as unblocked
ENCLS[EPA] Add version array
ENCLS[EREMOVE] Remove a page from EPC
ENCLS[ETRACK] Activate EBLOCK checks
ENCLS[EWB] Write back/invalidate an EPC page
Vol. 1 5-37
INSTRUCTION SET SUMMARY
5-38 Vol. 1
3. Updates to Chapter 13, Volume 1
Updates to Chapter 13 added to the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1:
Basic Architecture.
------------------------------------------------------------------------------------------
The XSAVE feature set extends the functionality of the FXSAVE and FXRSTOR instructions (see Section 10.5,
“FXSAVE and FXRSTOR Instructions”) by supporting the saving and restoring of processor state in addition to the
x87 execution environment (x87 state) and the registers used by the streaming SIMD extensions (SSE state).
The XSAVE feature set comprises eight instructions. XGETBV and XSETBV allow software to read and write the
extended control register XCR0, which controls the operation of the XSAVE feature set. XSAVE, XSAVEOPT,
XSAVEC, and XSAVES are four instructions that save processor state to memory; XRSTOR and XRSTORS are corre-
sponding instructions that load processor state from memory. XGETBV, XSAVE, XSAVEOPT, XSAVEC, and XRSTOR
can be executed at any privilege level; XSETBV, XSAVES, and XRSTORS can be executed only if CPL = 0. In addition
to XCR0, the XSAVES and XRSTORS instructions are controlled also by the IA32_XSS MSR (index DA0H).
The XSAVE feature set organizes the state that manages into state components. Operation of the instructions is
based on state-component bitmaps that have the same format as XCR0 and as the IA32_XSS MSR: each bit
corresponds to a state component. Section 13.1 discusses these state components and bitmaps in more detail.
Section 13.2 describes how the processor enumerates support for the XSAVE feature set and for XSAVE-enabled
features (those features that require use of the XSAVE feature set for their enabling). Section 13.3 explains how
software can enable the XSAVE feature set and XSAVE-enabled features.
The XSAVE feature set allows saving and loading processor state from a region of memory called an XSAVE area.
Section 13.4 presents details of the XSAVE area and its organization. Each XSAVE-managed state component is
associated with a section of the XSAVE area. Section 13.5 describes in detail each of the XSAVE-managed state
components.
Section 13.7 through Section 13.12 describe the operation of XSAVE, XRSTOR, XSAVEOPT, XSAVEC, XSAVES, and
XRSTORS, respectively.
Vol. 1 13-1
MANAGING STATE USING THE XSAVE FEATURE SET
— State component 6 is used for the upper 256 bits of the registers ZMM0–ZMM15. These 16 256-bit values
are denoted ZMM0_H–ZMM15_H (ZMM_Hi256 state).
— State component 7 is used for the 16 512-bit registers ZMM16–ZMM31 (Hi16_ZMM state).
• Bit 8 corresponds to the state component used for the Intel Processor Trace MSRs (PT state).
• Bit 9 corresponds to the state component used for the protection-key feature’s register PKRU (PKRU state).
See Section 13.5.7.
• Bits 12:11 correspond to the two state components used for the additional register state used by Control-Flow
Enforcement Technology (CET state):
— State component 11 is used for the 2 MSRs controlling user-mode functionality for CET (CET_U state).
— State component 12 is used for the 3 MSRs containing shadow-stack pointers for privilege levels 0–2
(CET_S state).
• Bit 13 corresponds to the state component used for an MSR used to control hardware duty cycling (HDC
state). See Section 13.5.9.
Bit 10 and bits in the range 62:14 are not currently defined in state-component bitmaps and are reserved for future
expansion. As individual state components are defined using those bits, additional sub-sections will be updated
within Section 13.5 over time. Bit 63 is used for special functionality in some bitmaps and does not correspond to
any state component.
The state component corresponding to bit i of state-component bitmaps is called state component i. Thus, x87
state is state component 0; SSE state is state component 1; AVX state is state component 2; MPX state comprises
state components 3–4; AVX-512 state comprises state components 5–7; PT state is state component 8; PKRU
state is state component 9; CET state comprises state components 11–12; and HDC state is state component 13.
The XSAVE feature set uses state-component bitmaps in multiple ways. Most of the instructions use an implicit
operand (in EDX:EAX), called the instruction mask, which is the state-component bitmap that specifies the state
components on which the instruction operates.
Some state components are user state components, and they can be managed by the entire XSAVE feature set.
Other state components are supervisor state components, and they can be managed only by XSAVES and
XRSTORS. The state components corresponding to bit 9 and to bits in the range 7:0 are user state components;
those corresponding to bit 8 and to bits in the range 13:11 are supervisor state components.
Extended control register XCR0 contains a state-component bitmap that specifies the user state components that
software has enabled the XSAVE feature set to manage. If the bit corresponding to a state component is clear in
XCR0, instructions in the XSAVE feature set will not operate on that state component, regardless of the value of the
instruction mask.
The IA32_XSS MSR (index DA0H) contains a state-component bitmap that specifies the supervisor state compo-
nents that software has enabled XSAVES and XRSTORS to manage (XSAVE, XSAVEC, XSAVEOPT, and XRSTOR
cannot manage supervisor state components). If the bit corresponding to a state component is clear in the
IA32_XSS MSR, XSAVES and XRSTORS will not operate on that state component, regardless of the value of the
instruction mask.
Some XSAVE-supported features can be used only if XCR0 has been configured so that the features’ state compo-
nents can be managed by the XSAVE feature set. (This applies only to features with user state components.) Such
state components and features are XSAVE-enabled. In general, the processor will not modify (or allow modifica-
tion of) the registers of a state component of an XSAVE-enabled feature if the bit corresponding to that state
component is clear in XCR0. (If software clears such a bit in XCR0, the processor preserves the corresponding state
component.) If an XSAVE-enabled feature has not been fully enabled in XCR0, execution of any instruction defined
for that feature causes an invalid-opcode exception (#UD).
As will be explained in Section 13.3, the XSAVE feature set is enabled only if CR4.OSXSAVE[bit 18] = 1. If
CR4.OSXSAVE = 0, the processor treats XSAVE-enabled state features and their state components as if all bits in
XCR0 were clear; the state components cannot be modified and the features’ instructions cannot be executed.
The state components for x87 state, for SSE state, for PT state, for PKRU state, and for HDC state are XSAVE-
managed but the corresponding features are not XSAVE-enabled. Processors allow modification of this state, as
well as execution of x87 FPU instructions and SSE instructions and use of Intel Processor Trace, protection keys,
CET, and hardware duty cycling regardless of the value of CR4.OSXSAVE and XCR0.
13-2 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
1. If CPUID.1:ECX.XSAVE[bit 26] = 1, XGETBV and XSETBV may be executed with ECX = 0 (to read and write XCR0). Any support for
execution of these instructions with other values of ECX is enumerated separately.
Vol. 1 13-3
MANAGING STATE USING THE XSAVE FEATURE SET
NOTE
In summary, the XSAVE feature set supports state component i (0 ≤ i < 63) if one of the following
is true: (1) i < 32 and CPUID.(EAX=0DH,ECX=0):EAX[i] = 1; (2) i ≥ 32 and
CPUID.(EAX=0DH,ECX=0):EAX[i–32] = 1; (3) i < 32 and CPUID.(EAX=0DH,ECX=1):ECX[i] = 1;
or (4) i ≥ 32 and CPUID.(EAX=0DH,ECX=1):EDX[i–32] = 1. The XSAVE feature set supports user
state component i if (1) or (2) holds; if (3) or (4) holds, state component i is a supervisor state
component and support is limited to XSAVES and XRSTORS.
— CPUID function 0DH, sub-function i (i > 1). This sub-function enumerates details for state component i. If
the XSAVE feature set supports state component i (see note above), the following items provide specific
details:
• EAX enumerates the size (in bytes) required for state component i.
• If state component i is a user state component, EBX enumerates the offset (in bytes, from the base of
the XSAVE area) of the section used for state component i. (This offset applies only when the standard
format for the extended region of the XSAVE area is being used; see Section 13.4.3.)
• If state component i is a supervisor state component, EBX returns 0.
• If state component i is a user state component, ECX[0] return 0; if state component i is a supervisor
state component, ECX[0] returns 1.
• The value returned by ECX[1] indicates the alignment of state component i when the compacted format
of the extended region of an XSAVE area is used (see Section 13.4.3). If ECX[1] returns 0, state
component i is located immediately following the preceding state component; if ECX[1] returns 1, state
component i is located on the next 64-byte boundary following the preceding state component.
• ECX[31:2] and EDX return 0.
If the XSAVE feature set does not support state component i, sub-function i returns 0 in EAX, EBX, ECX, and
EDX.
13-4 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
Components” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for how system
software can avoid this penalty.
• XCR0[4:3] are associated with MPX state (see Section 13.5.4). Software can use the XSAVE feature set to
manage MPX state only if XCR0[4:3] = 11b. In addition, software can execute MPX instructions only if
CR4.OSXSAVE = 1 and XCR0[4:3] = 11b. Otherwise, any execution of an MPX instruction causes an invalid-
opcode exception (#UD).1
XCR0[4:3] have value 00b coming out of RESET. As noted in Section 13.2, a processor allows software to set
XCR0[4:3] to 11b if and only if CPUID.(EAX=0DH,ECX=0):EAX[4:3] = 11b. In addition, executing the XSETBV
instruction causes a general-protection fault (#GP) if ECX = 0, EAX[4:3] is neither 00b nor 11b; that is,
software can enable the XSAVE feature set for MPX state only if it does so for both state components.
As noted in Section 13.1, the processor will preserve MPX state unmodified if software clears XCR0[4:3].
• XCR0[7:5] are associated with AVX-512 state (see Section 13.5.5). Software can use the XSAVE feature set to
manage AVX-512 state only if XCR0[7:5] = 111b. In addition, software can execute AVX-512 instructions only
if CR4.OSXSAVE = 1 and XCR0[7:5] = 111b. Otherwise, any execution of an AVX-512 instruction causes an
invalid-opcode exception (#UD).
XCR0[7:5] have value 000b coming out of RESET. As noted in Section 13.2, a processor allows software to set
XCR0[7:5] to 111b if and only if CPUID.(EAX=0DH,ECX=0):EAX[7:5] = 111b. In addition, executing the
XSETBV instruction causes a general-protection fault (#GP) if ECX = 0, EAX[7:5] is not 000b, and any bit is
clear in EAX[2:1] or EAX[7:5]; that is, software can enable the XSAVE feature set for AVX-512 state only if it
does so for all three state components, and only if it also does so for AVX state and SSE state. This implies that
the value of XCR0[7:5] is always either 000b or 111b.
As noted in Section 13.1, the processor will preserve AVX-512 state unmodified if software clears XCR0[7:5].
However, clearing XCR0[7:5] while AVX-512 state is not in its initial configuration may cause SSE and AVX
instructions to incur a power and performance penalty. See Section 13.5.3, “Enable the Use Of XSAVE Feature
Set And XSAVE State Components” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
3A, for how system software can avoid this penalty.
• XCR0[9] is associated with PKRU state (see Section 13.5.7). Software can use the XSAVE feature set to
manage PKRU state only if XCR0[9] = 1. The value of XCR0[9] in no way determines whether software can use
protection keys or execute other instructions that access PKRU state (these instructions can be executed even
if XCR0[9] = 0).
XCR0[9] is 0 coming out of RESET. As noted in Section 13.2, a processor allows software to set XCR0[9] if and
only if CPUID.(EAX=0DH,ECX=0):EAX[9] = 1.
• XCR0[63:10] and XCR0[8] are reserved.2 Executing the XSETBV instruction causes a general-protection fault
(#GP) if ECX = 0 and any corresponding bit in EDX:EAX is not 0. These bits in XCR0 are all 0 coming out of
RESET.
Software operating with CPL > 0 may need to determine whether the XSAVE feature set and certain XSAVE-
enabled features have been enabled. If CPL > 0, execution of the MOV from CR4 instruction causes a general-
protection fault (#GP). The following alternative mechanisms allow software to discover the enabling of the XSAVE
feature set regardless of CPL:
• The value of CR4.OSXSAVE is returned in CPUID.1:ECX.OSXSAVE[bit 27]. If software determines that
CPUID.1:ECX.OSXSAVE = 1, the processor supports the XSAVE feature set and the feature set has been
enabled in CR4.
• Executing the XGETBV instruction with ECX = 0 returns the value of XCR0 in EDX:EAX. XGETBV can be
executed if CR4.OSXSAVE = 1 (if CPUID.1:ECX.OSXSAVE = 1), regardless of CPL.
Thus, software can use the following algorithm to determine the support and enabling for the XSAVE feature set:
1. Use CPUID to discover the value of CPUID.1:ECX.OSXSAVE.
— If the bit is 0, either the XSAVE feature set is not supported by the processor or has not been enabled by
software. Either way, the XSAVE feature set is not available, nor are XSAVE-enabled features such as AVX.
1. If XCR0[3] = 0, executions of CALL, RET, JMP, and Jcc do not initialize the bounds registers.
2. Bit 8 and bits 13:11 correspond to supervisor state components. Since bits can be set in XCR0 only for user state components,
those bits of XCR0 must be 0.
Vol. 1 13-5
MANAGING STATE USING THE XSAVE FEATURE SET
— If the bit is 1, the processor supports the XSAVE feature set — including the XGETBV instruction — and it
has been enabled by software. The XSAVE feature set can be used to manage x87 state (because XCR0[0]
is always 1). Software requiring more detailed information can go on to the next step.
2. Execute XGETBV with ECX = 0 to discover the value of XCR0. If XCR0[1] = 1, the XSAVE feature set can be
used to manage SSE state. If XCR0[2] = 1, the XSAVE feature set can be used to manage AVX state and
software can execute AVX instructions. If XCR0[4:3] is 11b, the XSAVE feature set can be used to manage MPX
state and software can execute MPX instructions. If XCR0[7:5] is 111b, the XSAVE feature set can be used to
manage AVX-512 state and software can execute AVX-512 instructions. If XCR0[9] = 1, the XSAVE feature set
can be used to manage PKRU state.
The IA32_XSS MSR (with MSR index DA0H) is zero coming out of RESET. If CR4.OSXSAVE = 1,
CPUID.(EAX=0DH,ECX=1):EAX[3] = 1, and CPL = 0, executing the WRMSR instruction with ECX = DA0H writes
the 64-bit value in EDX:EAX to the IA32_XSS MSR (EAX is written to IA32_XSS[31:0] and EDX to
IA32_XSS[63:32]). The following items provide details regarding individual bits in the IA32_XSS MSR:
• IA32_XSS[8] is associated with PT state (see Section 13.5.6). Software can use XSAVES and XRSTORS to
manage PT state only if IA32_XSS[8] = 1. The value of IA32_XSS[8] does not determine whether software can
use Intel Processor Trace (the feature can be used even if IA32_XSS[8] = 0).
• IA32_XSS[12:11] are associated with CET state (see Section 13.5.8), IA32_XSS[11] with CET_U state and
IA32_XSS[12] with CET_S state. Software can use the XSAVES and XRSTORS to manage CET_U state (respec-
tively, CET_S state) only if IA32_XSS[11] = 1 (respectively, IA32_XSS[12] = 1). The value of
IA32_XSS[12:11] does not determine whether software can use CET (the feature can be used even if either of
IA32_XSS[12:11] is 0).
• IA32_XSS[13] is associated with HDC state (see Section 13.5.9). Software can use XSAVES and XRSTORS to
manage HDC state only if IA32_XSS[13] = 1. The value of IA32_XSS[13] does not determine whether software
can use hardware duty cycling (the feature can be used even if IA32_XSS[13] = 0).
• IA32_XSS[63:14], IA32_XSS[10:9] and IA32_XSS[7:0] are reserved.1 Executing the WRMSR instruction
causes a general-protection fault (#GP) if ECX = DA0H and any corresponding bit in EDX:EAX is not 0. These
bits in XCR0 are all 0 coming out of RESET.
The IA32_XSS MSR is 0 coming out of RESET.
There is no mechanism by which software operating with CPL > 0 can discover the value of the IA32_XSS MSR.
1. Bit 9 and bits 7:0 correspond to user state components. Since bits can be set in the IA32_XSS MSR only for supervisor state compo-
nents, those bits of the MSR must be 0.
13-6 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
FIP[63:48] or FCS or
FIP[31:0] FOP Rsvd. FTW FSW FCW 0
reserved FIP[47:32]
FDP[63:48] FDS or
MXCSR_MASK MXCSR or reserved FDP[31:0] 16
FDP[47:32]
Reserved ST0/MM0 32
Reserved ST1/MM1 48
Reserved ST2/MM2 64
Reserved ST3/MM3 80
Reserved ST4/MM4 96
Reserved ST5/MM5 112
Reserved ST6/MM6 128
Reserved ST7/MM7 144
XMM0 160
XMM1 176
XMM2 192
XMM3 208
XMM4 224
XMM5 240
XMM6 256
XMM7 272
XMM8 288
XMM9 304
XMM10 320
XMM11 336
XMM12 352
XMM13 368
XMM14 384
XMM15 400
The x87 state component comprises bytes 23:0 and bytes 159:32. The SSE state component comprises
bytes 31:24 and bytes 415:160. The XSAVE feature set does not use bytes 511:416; bytes 463:416 are reserved.
Section 13.7 through Section 13.9 provide details of how instructions in the XSAVE feature set use the legacy
region of an XSAVE area.
Vol. 1 13-7
MANAGING STATE USING THE XSAVE FEATURE SET
13-8 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
Vol. 1 13-9
MANAGING STATE USING THE XSAVE FEATURE SET
1. While MXCSR and MXCSR_MASK are part of SSE state, their treatment by the XSAVE feature set is not the same as that of the XMM
registers. See Section 13.7 through Section 13.11 for details.
13-10 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
As noted in Section 13.1, the XSAVE feature set manages MPX state as state components 3–4. Thus, MPX state is
located in the extended region of the XSAVE area (see Section 13.4.3). The following items detail how these state
components are organized in this region:
• BNDREGS state.
As noted in Section 13.2, CPUID.(EAX=0DH,ECX=3):EBX enumerates the offset (in bytes, from the base of the
XSAVE area) of the section of the extended region of the XSAVE area used for BNDREGS state (when the
standard format of the extended region is used). CPUID.(EAX=0DH,ECX=3):EAX enumerates the size (in
bytes) required for BNDREGS state. The BNDREGS section is used for the 4 128-bit bound registers BND0–
BND3, with bytes 16i+15:16i being used for BNDi.
• BNDCSR state.
As noted in Section 13.2, CPUID.(EAX=0DH,ECX=4):EBX enumerates the offset of the section of the extended
region of the XSAVE area used for BNDCSR state (when the standard format of the extended region is used).
CPUID.(EAX=0DH,ECX=4):EAX enumerates the size (in bytes) required for BNDCSR state. In the BNDSCR
section, bytes 7:0 are used for BNDCFGU and bytes 15:8 are used for BNDSTATUS.
Both components of MPX state are XSAVE-managed and the MPX feature is XSAVE-enabled. The XSAVE feature set
can operate on MPX state only if the feature set is enabled (CR4.OSXSAVE = 1) and has been configured to manage
MPX state (XCR0[4:3] = 11b). MPX instructions cannot be used unless the XSAVE feature set is enabled and has
been configured to manage MPX state.
Vol. 1 13-11
MANAGING STATE USING THE XSAVE FEATURE SET
The XSAVE feature set accesses Hi16_ZMM state only in 64-bit mode. Executions of XSAVE, XSAVEOPT,
XSAVEC, and XSAVES outside 64-bit mode do not modify the Hi16_ZMM section; executions of XRSTOR and
XRSTORS outside 64-bit mode do not update ZMM16–ZMM31. See Section 13.13. In general,
bytes 64(i-16)+63:64(i-16) are used for ZMMi (for 16 ≤ i ≤ 31).
All three components of AVX-512 state are XSAVE-managed and the AVX-512 feature is XSAVE-enabled. The
XSAVE feature set can operate on AVX-512 state only if the feature set is enabled (CR4.OSXSAVE = 1) and has
been configured to manage AVX-512 state (XCR0[7:5] = 111b). AVX-512 instructions cannot be used unless the
XSAVE feature set is enabled and has been configured to manage AVX-512 state.
13.5.6 PT State
The register state used by Intel Processor Trace (PT state) comprises the following 9 MSRs: IA32_RTIT_CTL,
IA32_RTIT_OUTPUT_BASE, IA32_RTIT_OUTPUT_MASK_PTRS, IA32_RTIT_STATUS, IA32_RTIT_CR3_MATCH,
IA32_RTIT_ADDR0_A, IA32_RTIT_ADDR0_B, IA32_RTIT_ADDR1_A, and IA32_RTIT_ADDR1_B.1
As noted in Section 13.1, the XSAVE feature set manages PT state as supervisor state component 8. Thus, PT state
is located in the extended region of the XSAVE area (see Section 13.4.3). As noted in Section 13.2,
CPUID.(EAX=0DH,ECX=8):EAX enumerates the size (in bytes) required for PT state. The MSRs are each allocated
8 bytes in the state component in the order given above. Thus, IA32_RTIT_CTL is at byte offset 0,
IA32_RTIT_OUTPUT_BASE at byte offset 8, etc. Any locations in the state component at or beyond byte offset 72
are reserved.
PT state is XSAVE-managed but Intel Processor Trace is not XSAVE-enabled. The XSAVE feature set can operate on
PT state only if the feature set is enabled (CR4.OSXSAVE = 1) and has been configured to manage PT state
(IA32_XSS[8] = 1). Software can otherwise use Intel Processor Trace and access its MSRs (using RDMSR and
WRMSR) even if the XSAVE feature set is not enabled or has not been configured to manage PT state.
The following items describe special treatment of PT state by the XSAVES and XRSTORS instructions:
• If XSAVES saves PT state, the instruction clears IA32_RTIT_CTL.TraceEn (bit 0) after saving the value of the
IA32_RTIT_CTL MSR and before saving any other PT state. If XSAVES causes a fault or a VM exit, it restores
IA32_RTIT_CTL.TraceEn to its original value.
• If XSAVES saves PT state, the instruction saves zeroes in the reserved portions of the state component.
• If XRSTORS would restore (or initialize) PT state and IA32_RTIT_CTL.TraceEn = 1, the instruction causes a
general-protection exception (#GP) before modifying PT state.
• If XRSTORS causes an exception or a VM exit, it does so before any modification to IA32_RTIT_CTL.TraceEn
(even if it has loaded other PT state).
1. These MSRs might not be supported by every processor that supports Intel Processor Trace. Software can use the CPUID instruction
to discover which are supported; see Section 35.3.1, “Detection of Intel Processor Trace and Capability Enumeration,” of Intel® 64
and IA-32 Architectures Software Developer’s Manual, Volume 3C.
13-12 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
The value of the PKRU register determines the access rights for user-mode linear addresses. (See Section 4.6,
“Access Rights,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.) The access rights
that pertain to an execution of the XRSTOR and XRSTORS instructions are determined by the value of the register
before the execution and not by any value that the execution might load into the PKRU register.
1. The IA32_S_CET and IA32_INTERRUPT_SSP_TABLE_ADDR MSRs also control CET when CPL < 3. However, they are not managed by
the XSAVE feature set and are thus not considered in this chapter.
Vol. 1 13-13
MANAGING STATE USING THE XSAVE FEATURE SET
A processor can support the init and modified optimizations with special hardware that tracks the state components
that might benefit from those optimizations. Other implementations might not include such hardware; such a
processor would always consider each such state component as not in its initial configuration and as modified since
the last execution of XRSTOR or XRSTORS.
The following notation describes the state of the init and modified optimizations:
• XINUSE denotes the state-component bitmap corresponding to the init optimization. If XINUSE[i] = 0, state
component i is known to be in its initial configuration; otherwise XINUSE[i] = 1. It is possible for XINUSE[i] to
be 1 even when state component i is in its initial configuration. On a processor that does not support the init
optimization, XINUSE[i] is always 1 for every value of i.
Executing XGETBV with ECX = 1 returns in EDX:EAX the logical-AND of XCR0 and the current value of the
XINUSE state-component bitmap. Such an execution of XGETBV always sets EAX[1] to 1 if XCR0[1] = 1 and
MXCSR does not have its RESET value of 1F80H. Section 13.2 explains how software can determine whether a
processor supports this use of XGETBV.
• XMODIFIED denotes the state-component bitmap corresponding to the modified optimization. If
XMODIFIED[i] = 0, state component i is known not to have been modified since the most recent execution of
XRSTOR or XRSTORS; otherwise XMODIFIED[i] = 1. It is possible for XMODIFIED[i] to be 1 even when state
component i has not been modified since the most recent execution of XRSTOR or XRSTORS. On a processor
that does not support the modified optimization, XMODIFIED[i] is always 1 for every value of i.
A processor that implements the modified optimization saves information about the most recent execution of
XRSTOR or XRSTORS in a quantity called XRSTOR_INFO, a 4-tuple containing the following: (1) the CPL;
(2) whether the logical processor was in VMX non-root operation; (3) the linear address of the XSAVE area; and
(4) the XCOMP_BV field in the XSAVE area. An execution of XSAVEOPT or XSAVES uses the modified optimization
only if that execution corresponds to XRSTOR_INFO on these four parameters.
This mechanism implies that, depending on details of the operating system, the processor might determine that an
execution of XSAVEOPT by one user application corresponds to an earlier execution of XRSTOR by a different appli-
cation. For this reason, Intel recommends the application software not use the XSAVEOPT instruction.
The following items specify the initial configuration each state component (for the purposes of defining the XINUSE
bitmap):
• x87 state. x87 state is in its initial configuration if the following all hold: FCW is 037FH; FSW is 0000H; FTW is
FFFFH; FCS and FDS are each 0000H; FIP and FDP are each 00000000_00000000H; each of ST0–ST7 is
0000_00000000_00000000H.
• SSE state. In 64-bit mode, SSE state is in its initial configuration if each of XMM0–XMM15 is 0. Outside 64-bit
mode, SSE state is in its initial configuration if each of XMM0–XMM7 is 0. XINUSE[1] pertains only to the state
of the XMM registers and not to MXCSR. An execution of XRSTOR or XRSTORS outside 64-bit mode does not
update XMM8–XMM15. (See Section 13.13.)
• AVX state. In 64-bit mode, AVX state is in its initial configuration if each of YMM0_H–YMM15_H is 0. Outside
64-bit mode, AVX state is in its initial configuration if each of YMM0_H–YMM7_H is 0. An execution of XRSTOR
or XRSTORS outside 64-bit mode does not update YMM8_H–YMM15_H. (See Section 13.13.)
• BNDREGS state. BNDREGS state is in its initial configuration if the value of each of BND0–BND3 is 0.
• BNDCSR state. BNDCSR state is in its initial configuration if BNDCFGU and BNDCSR each has value 0.
• Opmask state. Opmask state is in its initial configuration if each of the opmask registers k0–k7 is 0.
• ZMM_Hi256 state. In 64-bit mode, ZMM_Hi256 state is in its initial configuration if each of ZMM0_H–
ZMM15_H is 0. Outside 64-bit mode, ZMM_Hi256 state is in its initial configuration if each of ZMM0_H–ZMM7_H
is 0. An execution of XRSTOR or XRSTORS outside 64-bit mode does not update ZMM8_H–ZMM15_H. (See
Section 13.13.)
• Hi16_ZMM state. In 64-bit mode, Hi16_ZMM state is in its initial configuration if each of ZMM16–ZMM31 is 0.
Outside 64-bit mode, Hi16_ZMM state is always in its initial configuration. An execution of XRSTOR or XRSTORS
outside 64-bit mode does not update ZMM31–ZMM31. (See Section 13.13.)
• PT state. PT state is in its initial configuration if each of the 9 MSRs is 0.
• PKRU state. PKRU state is in its initial configuration if the value of the PKRU is 0.
• PT state. PT state is in its initial configuration if each of the 9 MSRs is 0.
• CET_U state. CET_U state is in its initial configuration if both of the MSRs are 0.
13-14 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
• CET_S state. CET_S state is in its initial configuration if each of the three MSRs is 0.
• HDC state. HDC state is in its initial configuration if the value of the IA32_PM_CTL1 MSR is 1.
1. If CR0.AM = 1, CPL = 3, and EFLAGS.AC =1, an alignment-check exception (#AC) may occur instead of #GP.
Vol. 1 13-15
MANAGING STATE USING THE XSAVE FEATURE SET
1. If CR0.AM = 1, CPL = 3, and EFLAGS.AC =1, an alignment-check exception (#AC) may occur instead of #GP.
2. If the processor does not support the compacted form of XRSTOR, it may execute the standard form of XRSTOR without first read-
ing the XCOMP_BV field. A processor supports the compacted form of XRSTOR only if it enumerates
CPUID.(EAX=0DH,ECX=1):EAX[1] as 1.
3. Bytes 63:24 of the XSAVE header are also reserved. Software should ensure that bytes 63:16 of the XSAVE header are all 0 in any
XSAVE area. (Bytes 15:8 should also be 0 if the XSAVE area is to be used on a processor that does not support the compaction
extensions to the XSAVE feature set.)
13-16 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
If XSTATE_BV[1] = 0, the compacted form XRSTOR initializes MXCSR to 1F80H. (This differs from the standard
from of XRSTOR, which loads MXCSR from the XSAVE area whenever either RFBM[1] or RFBM[2] is set.)
State component i is set to its initial configuration as indicated above if RFBM[i] = 1 and XSTATE_BV[i] = 0 —
even if XCOMP_BV[i] = 0. This is true for all values of i, including 0 (x87 state) and 1 (SSE state).
• If XSTATE_BV[i] = 1, the state component is loaded with data from the XSAVE area.1 See Section 13.5 for
specifics for each state component and for details regarding mode-specific operation and operation determined
by instruction prefixes. See Section 13.13 for details regarding faults caused by memory accesses.
State components 0 and 1 are located in the legacy region of the XSAVE area (see Section 13.4.1). Each state
component i, 2 ≤ i ≤ 62, is located in the extended region; the compacted form of the XRSTOR instruction uses
the compacted format for the extended region (see Section 13.4.3).
The MXCSR register is part of SSE state (see Section 13.5.2) and is thus loaded from memory if RFBM[1] =
XSTATE_BV[i] = 1. The compacted form of XRSTOR does not consider RFBM[2] (AVX) when determining whether
to update MXCSR. (This is a difference from the standard form of XRSTOR.) The compacted form of XRSTOR causes
a general-protection exception (#GP) if it would load MXCSR with an illegal value.
1. Earlier fault checking ensured that, if the instruction has reached this point in execution and XSTATE_BV[i] is 1, then XCOMP_BV[i] is
also 1.
Vol. 1 13-17
MANAGING STATE USING THE XSAVE FEATURE SET
The following conditions cause execution of the XSAVEOPT instruction to generate a fault:
• If the XSAVE feature set is not enabled (CR4.OSXSAVE = 0), an invalid-opcode exception (#UD) occurs.
• If CR0.TS[bit 3] is 1, a device-not-available exception (#NM) occurs.
• If the address of the XSAVE area is not 64-byte aligned, a general-protection exception (#GP) occurs.1
If none of these conditions cause a fault, execution of XSAVEOPT reads the XSTATE_BV field of the XSAVE header
(see Section 13.4.2) and writes it back to memory, setting XSTATE_BV[i] (0 ≤ i ≤ 63) as follows:
• If RFBM[i] = 0, XSTATE_BV[i] is not changed.
• If RFBM[i] = 1, XSTATE_BV[i] is set to the value of XINUSE[i]. Section 13.6 defines XINUSE to describe the
processor init optimization and specifies the initial configuration of each state component. The nature of that
optimization implies the following:
— If the state component is in its initial configuration, XINUSE[i] may be either 0 or 1, and XSTATE_BV[i] may
be written with either 0 or 1.
XINUSE[1] pertains only to the state of the XMM registers and not to MXCSR. Thus, XSTATE_BV[1] may be
written with 0 even if MXCSR does not have its RESET value of 1F80H.
— If the state component is not in its initial configuration, XSTATE_BV[i] is written with 1.
(As explained in Section 13.6, the initial configurations of some state components may depend on whether the
processor is in 64-bit mode.)
The XSAVEOPT instruction does not write any part of the XSAVE header other than the XSTATE_BV field; in partic-
ular, it does not write to the XCOMP_BV field.
Execution of XSAVEOPT saves into the XSAVE area those state components corresponding to bits that are set in
RFBM (subject to the optimizations described below). State components 0 and 1 are located in the legacy region of
the XSAVE area (see Section 13.4.1). Each state component i, 2 ≤ i ≤ 62, is located in the extended region; the
XSAVEOPT instruction always uses the standard format for the extended region (see Section 13.4.3).
See Section 13.5 for specifics for each state component and for details regarding mode-specific operation and
operation determined by instruction prefixes. See Section 13.13 for details regarding faults caused by memory
accesses.
Execution of XSAVEOPT performs two optimizations that reduce the amount of data written to memory:
• Init optimization.
If XINUSE[i] = 0, state component i is not saved to the XSAVE area (even if RFBM[i] = 1). (See below for
exceptions made for MXCSR.)
• Modified optimization.
Each execution of XRSTOR and XRSTORS establishes XRSTOR_INFO as a 4-tuple w,x,y,z (see Section 13.8.3
and Section 13.12). Execution of XSAVEOPT uses the modified optimization only if the following all hold for the
current value of XRSTOR_INFO:
— w = CPL;
— x = 1 if and only if the logical processor is in VMX non-root operation;
— y is the linear address of the XSAVE area being used by XSAVEOPT; and
— z is 00000000_00000000H. (This last item implies that XSAVEOPT does not use the modified optimization
if the last execution of XRSTOR used the compacted form, or if an execution of XRSTORS followed the last
execution of XRSTOR.)
If XSAVEOPT uses the modified optimization and XMODIFIED[i] = 0 (see Section 13.6), state component i is
not saved to the XSAVE area.
(In practice, the benefit of the modified optimization for state component i depends on how the processor is
tracking state component i; see Section 13.6. Limitations on the tracking ability may result in state component
i being saved even though is in the same configuration that was loaded by the previous execution of XRSTOR.)
1. If CR0.AM = 1, CPL = 3, and EFLAGS.AC =1, an alignment-check exception (#AC) may occur instead of #GP.
13-18 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
Depending on details of the operating system, an execution of XSAVEOPT by a user application might use the
modified optimization when the most recent execution of XRSTOR was by a different application. Because of
this, Intel recommends the application software not use the XSAVEOPT instruction.
The MXCSR register and MXCSR_MASK are part of SSE state (see Section 13.5.2) and are thus associated with
bit 1 of RFBM. However, the XSAVEOPT instruction also saves these values when RFBM[2] = 1 (even if RFBM[1] =
0). The init and modified optimizations do not apply to the MXCSR register and MXCSR_MASK.
1. If CR0.AM = 1, CPL = 3, and EFLAGS.AC =1, an alignment-check exception (#AC) may occur instead of #GP.
2. Unlike the XSAVE and XSAVEOPT instructions, the XSAVEC instruction does not read the XSTATE_BV field of the XSAVE header.
Vol. 1 13-19
MANAGING STATE USING THE XSAVE FEATURE SET
registers) even if XINUSE[1] = 0. Unlike the XSAVE instruction, RFBM[2] does not determine whether XSAVEC
saves MXCSR and MXCSR_MASK.
13-20 Vol. 1
MANAGING STATE USING THE XSAVE FEATURE SET
• y is the linear address of the XSAVE area being used by XSAVEOPT; and
• z[63] is 1 and z[62:0] = RFBM[62:0]. (This last item implies that XSAVES does not use the modified optimi-
zation if the last execution of XRSTOR used the standard form and followed the last execution of XRSTORS.)
If XSAVES uses the modified optimization and XMODIFIED[i] = 0 (see Section 13.6), state component i is not
saved to the XSAVE area.
1. Earlier fault checking ensured that, if the instruction has reached this point in execution and XSTATE_BV[i] is 1, then XCOMP_BV[i] is
also 1.
Vol. 1 13-21
MANAGING STATE USING THE XSAVE FEATURE SET
If an execution of XRSTORS causes an exception or a VM exit during or after restoring a supervisor state compo-
nent, each element of that state component may have the value it held before the XRSTORS execution, the value
loaded from the XSAVE area, or the element’s initial value (as defined in Section 13.6). See Section 13.5.6 for some
special treatment of PT state for the case in which XRSTORS causes an exception or a VM exit.
Like XRSTOR, execution of XRSTORS causes the processor to update is tracking for the init and modified optimiza-
tions (see Section 13.6 and Section 13.8.3). The following items provide details:
• The processor updates its tracking for the init optimization as follows:
— If RFBM[i] = 0, XINUSE[i] is not changed.
— If RFBM[i] = 1 and XSTATE_BV[i] = 0, state component i may be tracked as init; XINUSE[i] may be set to
0 or 1.
— If RFBM[i] = 1 and XSTATE_BV[i] = 1, state component i is tracked as not init; XINUSE[i] is set to 1.
• The processor updates its tracking for the modified optimization and records information about the XRSTORS
execution for future interaction with the XSAVEOPT and XSAVES instructions as follows:
— If RFBM[i] = 0, state component i is tracked as modified; XMODIFIED[i] is set to 1.
— If RFBM[i] = 1, state component i may be tracked as unmodified; XMODIFIED[i] may be set to 0 or 1.
— XRSTOR_INFO is set to the 4-tuple w,x,y,z, where w is the CPL; x is 1 if the logical processor is in VMX
non-root operation and 0 otherwise; y is the linear address of the XSAVE area; and z is XCOMP_BV (this
implies that z[63] = 1).
13-22 Vol. 1
4. New Chapter 18, Volume 1
New Chapter 18 added to the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic
Architecture.
------------------------------------------------------------------------------------------
18.1 INTRODUCTION
Return-oriented programming (ROP), and similarly CALL/JMP-oriented programming (COP/JOP), have been the
prevalent attack methodologies for stealth exploit writers targeting vulnerabilities in programs. These attack meth-
odologies have the common elements:
• A code module with execution privilege and contain small snippets of code sequence with the characteristic: at
least one instruction in the sequence being a control transfer instruction that depends on data either in the
return stack or in a register for the target address.
• Diverting the control flow instruction (e.g., RET, CALL, JMP) from its original target address to a new target (via
modification in the data stack or in the register).
Control-Flow Enforcement Technology (CET) provides the following capabilities to defend against ROP/COP/JOP
style control-flow subversion attacks:
• Shadow stack: Return address protection to defend against ROP.
• Indirect branch tracking: Free branch protection to defend against COP/JOP.
Both capabilities introduce new instruction set extensions, and are described in the Intel® 64 and IA-32 Architec-
tures Software Developer’s Manual, Volumes 2A, 2B, 2C & 2D.
Control-Flow Enforcement Technology introduces a new exception (#CP) with interrupt vector 21.
Vol. 1 18-1
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
stream must be an ENDBRANCH. If the next instruction is not an ENDBRANCH, the processor causes a control
protection exception (#CP); otherwise, the state machine moves back to IDLE state.
18.2.1 Shadow Stack Pointer and its Operand and Address Size Attributes
When CET is enabled the processor supports a new architectural register, shadow stack pointer (SSP), when the
processor supports the shadow stack feature. The SSP cannot be directly encoded as a source, destination or
memory operand in instructions. The SSP points to the current top of the shadow stack.
The width of the shadow stack is 32-bit in 32-bit/compatibility mode and is 64-bit in 64-bit mode. The address-size
attribute of the shadow stack is likewise 32-bit in 32-bit/compatibility mode and 64-bit in 64-bit mode.
18.2.2 Terminology
When shadow stacks are enabled, certain control transfer instructions/flows and shadow stack management
instructions do loads/stores to the shadow stack. Such load/stores from control transfer instructions and shadow
stack management instructions are termed as shadow_stack_load and shadow_stack_store to distinguish them
from a load/store performed by other instructions like MOV, XSAVES, etc.
The pseudocode for the instruction operations use the notation ShadowStackEnabled(CPL) as a test of whether
shadow stacks are enabled at the CPL. This term returns a TRUE or FALSE indication as follows.
18-2 Vol. 1
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
ShadowStackEnabled(CPL):
IF CR4.CET = 1 AND CR0.PE = 1 AND EFLAGS.VM = 0
IF CPL = 3
THEN
(* Obtain the shadow stack enable from IA32_U_CET MSR (MSR address 6A0H) used to enable
feature for CPL = 3 *)
SHADOW_STACK_ENABLED = IA32_U_CET.SH_STK_EN;
ELSE
(* Obtain the shadow stack enable from IA32_S_CET MSR (MSR address 6A2H) used to enable
feature for CPL < 3 *)
SHADOW_STACK_ENABLED = IA32_S_CET.SH_STK_EN;
FI;
IF SHADOW_STACK_ENABLED = 1
THEN
return TRUE;
ELSE
return FALSE;
FI;
ELSE
(* Shadow stacks not enabled in real mode and virtual-8086 mode or if the master CET feature
enable in CR4 is disabled *)
return FALSE;
ENDIF
Vol. 1 18-3
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
0xFF8 | busy
IA32_PLx_SSP = 0xFF8
Figure 18-1. Supervisor Shadow Stack with a Supervisor Shadow Stack Token
The WRMSR instruction ensures that the address specified in the IA32_PLx_SSP MSR (where 0 ≤ x ≤ 3) is required
to be 8 byte aligned; otherwise, the instruction causes a general protection exception (#GP(0)). The processor
does the following checks prior to switching to a supervisor shadow stack programmed into the IA32_PLx_SSP
MSR. These steps are performed atomically.
1. Load the supervisor shadow stack token from the address specified in the IA32_PLx_SSP MSR using a
shadow_stack_load.
2. Check if the busy bit in the token is 0; reserved bits must be 0.
3. Check if the address programmed in the MSR matches the address in the supervisor shadow stack token;
reserved bits must be 0.
4. If checks 2 and 3 are successful, then set the busy bit in the token using a shadow_stack_store and switch the
SSP to the value specified in the IA32_PLx_SSP MSR.
5. If checks 2 or 3 fail, then the busy bit is not set and a #GP(0) exception is raised.
On a far RET, the instruction clears the busy bit in the shadow stack token as follows. These steps are also
performed atomically.
1. Load the supervisor shadow stack token from the SSP using a shadow_stack_load.
2. Check if the busy bit in the token is 1; reserved bits must be 0and reserved bits must be 0.
3. Check if the address programmed in supervisor shadow stack token matches SSP; reserved bits must be 0.
4. If checks 2 and 3 are successful, then clear the busy bit in the token using a shadow_stack_store; else continue
without modifying the contents of the shadow stack pointed to by SSP.
The operations described here are also applicable to a far transfer performed when calling an interrupt or exception
handler through an interrupt/trap gate in the IDT. Likewise, the IRET instruction behaves similar to the Far RET
instruction.
18-4 Vol. 1
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
Vol. 1 18-5
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
In this example, the initial SSP is 1000H and the shadow-stack-restore token is on a new shadow stack at address
3FF8H. The token at address 3FF8H holds the SSP when this restore point was created; in this example it is 4000H.
In order to switch to the new shadow stack, the RSTORSSP instruction is invoked with the memory operand
pointing set to 3FF8H. When the RSTORSSP instruction completes, the SSP is set to 3FF8H and the shadow-stack-
restore token at 3FF8H is replaced by a previous-ssp token that holds the address 1000H, i.e., the old SSP.
The following figure illustrates the SAVEPREVSSP instruction operation during a shadow stack switching sequence.
Current active shadow stack “shadow stack restore” token Current active shadow stack
with a “previous SSP” token pushed on previous shadow with a “previous SSP” token
stack following popped off
SAVEPREVSSP
To allow switching back to this old shadow stack, a SAVEPREVSSP instruction is now invoked. The SAVEPREVSSP
instruction does not take any memory operand and expects to find a previous-ssp token at the top of the shadow
stack, i.e., at address 3FF8H. The SAVEPREVSSP instruction then saves a shadow-stack-restore token on the old
shadow stack at address FF8H, and the token itself holds the address 1000H which is the address recorded in the
previous-ssp token. The SAVEPREVSSP instruction also pops the previous-ssp token off the current shadow stack
and thus the SSP following SAVEPREVSSP is 4000H.
Subsequently to switch back to the old shadow stack, a RSTORSSP instruction may be invoked with memory
operand set to FF8H.
If, following a switch to a new shadow stack, it is not required to create a restore point on the old shadow stack,
then the previous-ssp token created by the RSTORSSP instruction can be popped off the shadow stack by using the
INCSSP instruction.
See the SAVEPREVSSP and RSTORSSP instruction operations for the detailed algorithm.
18-6 Vol. 1
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
Vol. 1 18-7
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
#CP Fault
Code Breakpoint
Hardware Interrupts / Probe
RESET / #MC
Higher priority faults/traps/events that occur at the end of an indirect CALL/JMP are delivered ahead of any
#CP(ENDBRANCH) fault. The CET state machine at the privilege level where the higher priority fault/trap/event
occurred retains its state when the control transfers to the fault/trap/event handler. The instruction pointer pushed
on the stack for a #CP(ENDBRANCH) fault is the address of the instruction at the target of the indirect CALL/JMP
that caused the fault.
18.3.2 Terminology
The pseudocode for the instruction operations use a notation EndbranchEnabled(CPL) as a test of whether indirect
branch tracking is enabled at the CPL. This term returns a TRUE or FALSE indication as follows.
EndbranchEnabled(CPL):
IF CR4.CET = 1 AND CR0.PE = 1 AND EFLAGS.VM = 0
IF CPL = 3
THEN
(* Obtain the ENDBRANCH enable from MSR used to enable feature for CPL = 3 *)
ENDBR_ENABLED = IA32_U_CET.ENDBR_EN;
18-8 Vol. 1
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
ELSE
(* Obtain the ENDBRANCH enable from MSR used to enable feature for CPL < 3 *)
ENDBR_ENABLED = IA32_S_CET.ENDBR_EN;
FI;
IF ENDBR_ENABLED = 1
THEN
return TRUE;
ELSE
return FALSE;
FI;
ELSE
(* Indirect branch tracking is not enabled in real mode and virtual-8086 mode or if the master CET feature
enable in CR4 is disabled *)
return FALSE;
ENDIF
Vol. 1 18-9
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
occurs during execution of a user mode (CPL == 3) program and it causes the CPL to switch to supervisor mode
(CPL < 3) then, as part of the CPL change, the user mode CET indirect branch tracker becomes inactive and the
supervisor mode CET indirect branch tracker becomes active. A subsequent IRET is used by the interrupt handler
to return to the interrupted user mode program. This IRET causes the processor to switch the CPL to user mode
(CPL ==3) and, as part of the CPL change, the supervisor mode CET indirect branch tracker becomes inactive and
the user mode CET indirect branch tracker becomes active.
The CPL where the event or instruction that caused the control transfer occurs is termed the source CPL, and the
CET indirect branch tracker state at that CPL is referred here as the source CET indirect branch tracker state. The
CPL reached at the end of the control transfer is termed the destination CPL, and the CET indirect branch tracker
state at that CPL is referred to as the destination CET indirect branch tracker state.
This section describes various cases of control transfers that occur between user mode (CPL 3) and supervisor
mode (CPL < 3).
In all these cases the source CET indirect branch tracker state becomes not active and retains its state (IDLE,
WAIT_FOR_ENDBRANCH), and the target CET indirect branch tracker state becomes active if there was no fault
during the transfer.
• Case 1: Far CALL/JMPCALL/JMP, SYSCALL/SYSENTER
— If indirect branch tracking is enabled, the target indirect branch tracker state becomes active and is unsup-
pressed and goes to WAIT_FOR_ENDBRANCH. This enforces that the subroutine invoked by a far
CALL/JMPCALL/JMP must begin with an ENDBRANCH.
• Case 2: Hardware interrupt/trap/exception/NMI/Software interrupt/Machine Checks
— If indirect branch tracking is enabled, the target indirect branch tracker state becomes active and is unsup-
pressed and goes to WAIT_FOR_ENDBRANCH.
• Case 3: IRET
— If indirect branch tracking enabled, the target indirect branch tracker becomes active and keeps its state. If
the user mode was interrupted by a higher priority event, like an interrupt at the end of the indirect
CALL/JMP, then when an IRET is used to return to the interrupted user mode program, the user mode
indirect branch tracker retains its state and a #CP fault will occur if the next instruction decoded is not an
endbr32/64 according to mode of machine.
18-10 Vol. 1
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
Vol. 1 18-11
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
18-12 Vol. 1
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
TRACKER = IDLE
ENDIF
Vol. 1 18-13
CONTROL-FLOW ENFORCEMENT TECHNOLOGY (CET)
than 5 loads) past a missing ENDBRANCH, while later implementations will completely block the speculative execu-
tion of instructions after a missing ENDBRANCH.
This mechanism also limits or blocks speculation of the next sequential instructions after an indirect JMP or CALL,
presuming the JMP/CALL puts the CET tracker into the WAIT_FOR_ENDBRANCH state and the next sequential
instruction is not an ENDBRANCH.
18-14 Vol. 1
5. Updates to Chapter 1, Volume 2A
Change bars and green text show changes to Chapter 1 of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 2A: Instruction Set Reference, A-L.
------------------------------------------------------------------------------------------
Changes to this chapter: Updates to Section 1.1 “Intel® 64 and IA-32 Processors Covered in this Manual”.
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C & 2D: Instruction Set
Reference (order numbers 253666, 253667, 326018 and 334569) are part of a set that describes the architecture
and programming environment of all Intel 64 and IA-32 architecture processors. Other volumes in this set are:
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture (Order
Number 253665).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C & 3D: System
Programming Guide (order numbers 253668, 253669, 326019 and 332831).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model-Specific Registers
(order number 335592).
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, describes the basic architecture
and programming environment of Intel 64 and IA-32 processors. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volumes 2A, 2B, 2C & 2D, describe the instruction set of the processor and the opcode struc-
ture. These volumes apply to application programmers and to programmers who write operating systems or exec-
utives. The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C & 3D, describe
the operating-system support environment of Intel 64 and IA-32 processors. These volumes target operating-
system and BIOS designers. In addition, the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3B, addresses the programming environment for classes of software that host operating systems. The
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4, describes the model-specific registers
of Intel 64 and IA-32 processors.
Vol. 2A 1-1
ABOUT THIS MANUAL
1-2 Vol. 2A
ABOUT THIS MANUAL
Vol. 2A 1-3
ABOUT THIS MANUAL
The Intel® Atom™ processor C series, the Intel® Atom™ processor X series, the Intel® Pentium® processor J
series, the Intel® Celeron® processor J series, and the Intel® Celeron® processor N series are based on the Gold-
mont microarchitecture.
The Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series is based on the Knights Landing microarchitecture and
supports Intel 64 architecture.
The Intel® Pentium® Silver processor series, the Intel® Celeron® processor J series, and the Intel® Celeron®
processor N series are based on the Goldmont Plus microarchitecture.
The 8th generation Intel® Core™ processors, 9th generation Intel® Core™ processors, and Intel® Xeon® E proces-
sors are based on the Coffee Lake microarchitecture and support Intel 64 architecture.
The Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series is based on the Knights Mill microarchitecture and
supports Intel 64 architecture.
The 2nd generation Intel® Xeon® Processor Scalable Family is based on the Cascade Lake product and supports
Intel 64 architecture.
The 10th generation Intel® Core™ processors are based on the Ice Lake microarchitecture and support Intel 64
architecture.
IA-32 architecture is the instruction set architecture and programming environment for Intel's 32-bit microproces-
sors. Intel® 64 architecture is the instruction set architecture and programming environment which is the superset
of Intel’s 32-bit and 64-bit architectures. It is compatible with the IA-32 architecture.
1.2 OVERVIEW OF VOLUME 2A, 2B, 2C AND 2D: INSTRUCTION SET REFERENCE
A description of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C & 2D content
follows:
Chapter 1 — About This Manual. Gives an overview of all seven volumes of the Intel® 64 and IA-32 Architec-
tures Software Developer’s Manual. It also describes the notational conventions in these manuals and lists related
Intel® manuals and documentation of interest to programmers and hardware designers.
Chapter 2 — Instruction Format. Describes the machine-level instruction format used for all IA-32 instructions
and gives the allowable encodings of prefixes, the operand-identifier byte (ModR/M byte), the addressing-mode
specifier byte (SIB byte), and the displacement and immediate bytes.
Chapter 3 — Instruction Set Reference, A-L. Describes Intel 64 and IA-32 instructions in detail, including an
algorithmic description of operations, the effect on flags, the effect of operand- and address-size attributes, and
the exceptions that may be generated. The instructions are arranged in alphabetical order. General-purpose, x87
FPU, Intel MMX™ technology, SSE/SSE2/SSE3/SSSE3/SSE4 extensions, and system instructions are included.
Chapter 4 — Instruction Set Reference, M-U. Continues the description of Intel 64 and IA-32 instructions
started in Chapter 3. It starts Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B.
Chapter 5 — Instruction Set Reference, V-Z. Continues the description of Intel 64 and IA-32 instructions
started in chapters 3 and 4. It provides the balance of the alphabetized list of instructions and starts Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 2C.
Chapter 6 — Safer Mode Extensions Reference. Describes the safer mode extensions (SMX). SMX is intended
for a system executive to support launching a measured environment in a platform where the identity of the soft-
ware controlling the platform hardware can be measured for the purpose of making trust decisions. This chapter
starts Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2D.
Chapter 7— Instruction Set Reference Unique to Intel® Xeon Phi™ Processors. Describes the instruction
set that is unique to Intel® Xeon Phi™ processors based on the Knights Landing and Knights Mill microarchitec-
tures. The set is not supported in any other Intel processors.
Appendix A — Opcode Map. Gives an opcode map for the IA-32 instruction set.
Appendix B — Instruction Formats and Encodings. Gives the binary encoding of each form of each IA-32
instruction.
Appendix C — Intel® C/C++ Compiler Intrinsics and Functional Equivalents. Lists the Intel® C/C++ compiler
intrinsics and their assembly code equivalents for each of the IA-32 MMX and SSE/SSE2/SSE3 instructions.
1-4 Vol. 2A
ABOUT THIS MANUAL
Byte Offset
NOTE
Avoid any software dependence upon the state of reserved bits in IA-32 registers. Depending upon
the values of reserved register bits will make software dependent upon the unspecified manner in
which the processor handles these bits. Programs that depend upon reserved values risk incompat-
ibility with future processors.
Vol. 2A 1-5
ABOUT THIS MANUAL
where:
• A label is an identifier which is followed by a colon.
• A mnemonic is a reserved name for a class of instruction opcodes which have the same function.
• The operands argument1, argument2, and argument3 are optional. There may be from zero to three operands,
depending on the opcode. When present, they take the form of either literals or identifiers for data items.
Operand identifiers are either reserved names of registers or are assumed to be assigned to data items
declared in another part of the program (which may not be shown in the example).
When two operands are present in an arithmetic or logical instruction, the right operand is the source and the left
operand is the destination.
For example:
Segment-register:Byte-address
For example, the following segment address identifies the byte at address FF79H in the segment pointed by the DS
register:
DS:FF79H
The following segment address identifies an instruction address in the code segment. The CS register points to the
code segment and the EIP register contains the address of the instruction.
CS:EIP
1.3.6 Exceptions
An exception is an event that typically occurs when an instruction causes an error. For example, an attempt to
divide by zero generates an exception. However, some exceptions, such as breakpoints, occur under other condi-
tions. Some types of exceptions may provide error codes. An error code reports additional information about the
error. An example of the notation used to show an exception and error code is shown below:
#PF(fault code)
1-6 Vol. 2A
ABOUT THIS MANUAL
This example refers to a page-fault exception under conditions where an error code naming a type of fault is
reported. Under some conditions, exceptions which produce error codes may not be able to report an accurate
code. In this case, the error code is zero, as shown below for a general-protection exception:
#GP(0)
CPUID.01H:EDX.SSE[bit 25] = 1
CR4.OSFXSR[bit 9] = 1
Example CR name
IA32_MISC_ENABLE.ENABLEFOPCODE[bit 2] = 1
SDM29002
Figure 1-2. Syntax for CPUID, CR, and MSR Data Presentation
Vol. 2A 1-7
ABOUT THIS MANUAL
See also:
• The latest security information on Intel® products:
https://www.intel.com/content/www/us/en/security-center/default.html
• Software developer resources, guidance and insights for security advisories:
https://software.intel.com/security-software-guidance/
• The data sheet for a particular Intel 64 or IA-32 processor
• The specification update for a particular Intel 64 or IA-32 processor
• Intel® C++ Compiler documentation and online help:
http://software.intel.com/en-us/articles/intel-compilers/
• Intel® Fortran Compiler documentation and online help:
http://software.intel.com/en-us/articles/intel-compilers/
• Intel® Software Development Tools:
https://software.intel.com/en-us/intel-sdp-home
• Intel® 64 and IA-32 Architectures Software Developer’s Manual (in one, four or ten volumes):
https://software.intel.com/en-us/articles/intel-sdm
• Intel® 64 and IA-32 Architectures Optimization Reference Manual:
https://software.intel.com/en-us/articles/intel-sdm#optimization
• Intel 64 Architecture x2APIC Specification:
http://www.intel.com/content/www/us/en/architecture-and-technology/64-architecture-x2apic-specifi-
cation.html
• Intel® Trusted Execution Technology Measured Launched Environment Programming Guide:
http://www.intel.com/content/www/us/en/software-developers/intel-txt-software-development-guide.html
• Developing Multi-threaded Applications: A Platform Consistent Approach:
https://software.intel.com/sites/default/files/article/147714/51534-developing-multithreaded-applica-
tions.pdf
• Using Spin-Loops on Intel® Pentium® 4 Processor and Intel® Xeon® Processor:
https://software.intel.com/sites/default/files/22/30/25602
• Performance Monitoring Unit Sharing Guide
http://software.intel.com/file/30388
Literature related to selected features in future Intel processors are available at:
• Intel® Architecture Instruction Set Extensions Programming Reference
https://software.intel.com/en-us/isa-extensions
• Intel® Software Guard Extensions (Intel® SGX) Programming Reference
https://software.intel.com/en-us/isa-extensions/intel-sgx
More relevant links are:
• Intel® Developer Zone:
https://software.intel.com/en-us
• Developer centers:
http://www.intel.com/content/www/us/en/hardware-developers/developer-centers.html
• Processor support general link:
http://www.intel.com/support/processors/
• Intel® Hyper-Threading Technology (Intel® HT Technology):
http://www.intel.com/technology/platform-technology/hyper-threading/index.htm
1-8 Vol. 2A
6. Updates to Chapter 2, Volume 2A
Change bars and green text show changes to Chapter 2 of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 2A: Instruction Set Reference, A-L.
------------------------------------------------------------------------------------------
Changes to this chapter: Minor typo corrections, and update to the Alignment Check description in all exception
tables. Update to table 2-36 “EVEX Embedded Broadcast/Rounding/SAE and Vector Length on Vector Instruc-
tions”.
This chapter describes the instruction format for all Intel 64 and IA-32 processors. The instruction format for
protected mode, real-address mode and virtual-8086 mode is described in Section 2.1. Increments provided for IA-
32e mode and its sub-modes are described in Section 2.2.
7 6 5 3 2 0 7 6 5 3 2 0
Reg/
Mod Opcode R/M Scale Index Base
1. The REX prefix is optional, but if used must be immediately before the opcode; see Section
2.2.1, “REX Prefixes” for additional information.
2. For VEX encoding information, see Section 2.3, “Intel® Advanced Vector Extensions (Intel®
AVX)”.
3. Some rare instructions can take an 8B immediate or 8B displacement.
Vol. 2A 2-1
INSTRUCTION FORMAT
— BND prefix is encoded using F2H if the following conditions are true:
• CPUID.(EAX=07H, ECX=0):EBX.MPX[bit 14] is set.
• BNDCFGU.EN and/or IA32_BNDCFGS.EN is set.
• When the F2 prefix precedes a near CALL, a near RET, a near JMP, a short Jcc, or a near Jcc instruction
(see Chapter 17, “Intel® MPX,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 1).
• Group 2
— Segment override prefixes:
• 2EH—CS segment override (use with any branch instruction is reserved).
• 36H—SS segment override prefix (use with any branch instruction is reserved).
• 3EH—DS segment override prefix (use with any branch instruction is reserved).
• 26H—ES segment override prefix (use with any branch instruction is reserved).
• 64H—FS segment override prefix (use with any branch instruction is reserved).
• 65H—GS segment override prefix (use with any branch instruction is reserved).
— Branch hints1:
• 2EH—Branch not taken (used only with Jcc instructions).
• 3EH—Branch taken (used only with Jcc instructions).
• Group 3
• Operand-size override prefix is encoded using 66H (66H is also used as a mandatory prefix for some
instructions).
• Group 4
• 67H—Address-size override prefix.
The LOCK prefix (F0H) forces an operation that ensures exclusive use of shared memory in a multiprocessor envi-
ronment. See “LOCK—Assert LOCK# Signal Prefix” in Chapter 3, “Instruction Set Reference, A-L,” for a description
of this prefix.
Repeat prefixes (F2H, F3H) cause an instruction to be repeated for each element of a string. Use these prefixes
only with string and I/O instructions (MOVS, CMPS, SCAS, LODS, STOS, INS, and OUTS). Use of repeat prefixes
and/or undefined opcodes with other Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable
behavior.
Some instructions may use F2H,F3H as a mandatory prefix to express distinct functionality.
Branch hint prefixes (2EH, 3EH) allow a program to give a hint to the processor about the most likely code path for
a branch. Use these prefixes only with conditional branch instructions (Jcc). Other use of branch hint prefixes
and/or other undefined opcodes with Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable
behavior.
The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes. Either size can
be the default; use of the prefix selects the non-default size.
Some SSE2/SSE3/SSSE3/SSE4 instructions and instructions using a three-byte sequence of primary opcode bytes
may use 66H as a mandatory prefix to express distinct functionality.
Other use of the 66H prefix is reserved; such use may cause unpredictable behavior.
The address-size override prefix (67H) allows programs to switch between 16- and 32-bit addressing. Either size
can be the default; the prefix selects the non-default size. Using this prefix and/or other undefined opcodes when
operands for the instruction do not reside in memory is reserved; such use may cause unpredictable behavior.
1. Some earlier microarchitectures used these as branch hints, but recent generations have not and they are reserved for future hint
usage.
2-2 Vol. 2A
INSTRUCTION FORMAT
2.1.2 Opcodes
A primary opcode can be 1, 2, or 3 bytes in length. An additional 3-bit opcode field is sometimes encoded in the
ModR/M byte. Smaller fields can be defined within the primary opcode. Such fields define the direction of opera-
tion, size of displacements, register encoding, condition codes, or sign extension. Encoding fields used by an
opcode vary depending on the class of operation.
Two-byte opcode formats for general-purpose and SIMD instructions consist of one of the following:
• An escape opcode byte 0FH as the primary opcode and a second opcode byte.
• A mandatory prefix (66H, F2H, or F3H), an escape opcode byte, and a second opcode byte (same as previous
bullet).
For example, CVTDQ2PD consists of the following sequence: F3 0F E6. The first byte is a mandatory prefix (it is not
considered as a repeat prefix).
Three-byte opcode formats for general-purpose and SIMD instructions consist of one of the following:
• An escape opcode byte 0FH as the primary opcode, plus two additional opcode bytes.
• A mandatory prefix (66H, F2H, or F3H), an escape opcode byte, plus two additional opcode bytes (same as
previous bullet).
For example, PHADDW for XMM registers consists of the following sequence: 66 0F 38 01. The first byte is the
mandatory prefix.
Valid opcode expressions are defined in Appendix A and Appendix B.
Vol. 2A 2-3
INSTRUCTION FORMAT
Mod 11
RM 000
/digit (Opcode); REG = 001
C8H 11001000
2-4 Vol. 2A
INSTRUCTION FORMAT
NOTES:
1. The default segment register is SS for the effective addresses containing a BP index, DS for other effective addresses.
2. The disp16 nomenclature denotes a 16-bit displacement that follows the ModR/M byte and that is added to the index.
3. The disp8 nomenclature denotes an 8-bit displacement that follows the ModR/M byte and that is sign-extended and added to the
index.
Vol. 2A 2-5
INSTRUCTION FORMAT
NOTES:
1. The [--][--] nomenclature means a SIB follows the ModR/M byte.
2. The disp32 nomenclature denotes a 32-bit displacement that follows the ModR/M byte (or the SIB byte if one is present) and that is
added to the index.
3. The disp8 nomenclature denotes an 8-bit displacement that follows the ModR/M byte (or the SIB byte if one is present) and that is
sign-extended and added to the index.
Table 2-3 is organized to give 256 possible values of the SIB byte (in hexadecimal). General purpose registers used
as a base are indicated across the top of the table, along with corresponding values for the SIB byte’s base field.
Table rows in the body of the table indicate the register used as the index (SIB byte bits 3, 4 and 5) and the scaling
factor (determined by SIB byte bits 6 and 7).
2-6 Vol. 2A
INSTRUCTION FORMAT
NOTES:
1. The [*] nomenclature means a disp32 with no base if the MOD is 00B. Otherwise, [*] means disp8 or disp32 + [EBP]. This provides the
following address modes:
MOD bits Effective Address
00 [scaled index] + disp32
01 [scaled index] + disp8 + [EBP]
10 [scaled index] + disp32 + [EBP]
Vol. 2A 2-7
INSTRUCTION FORMAT
Grp 1, Grp (optional) 1-, 2-, or 1 byte 1 byte Address Immediate data
2, Grp 3, 3-byte (if required) (if required) displacement of of 1, 2, or 4
Grp 4 opcode 1, 2, or 4 bytes bytes or none
(optional)
2.2.1.1 Encoding
Intel 64 and IA-32 instruction formats specify up to three registers by using 3-bit fields in the encoding, depending
on the format:
• ModR/M: the reg and r/m fields of the ModR/M byte.
• ModR/M with SIB: the reg field of the ModR/M byte, the base and index fields of the SIB (scale, index, base)
byte.
• Instructions without ModR/M: the reg field of the opcode.
In 64-bit mode, these formats do not change. Bits needed to define fields in the 64-bit context are provided by the
addition of REX prefixes.
2-8 Vol. 2A
INSTRUCTION FORMAT
• REX.R modifies the ModR/M reg field when that field encodes a GPR, SSE, control or debug register. REX.R is
ignored when ModR/M specifies other registers or defines an extended opcode.
• REX.X bit modifies the SIB index field.
• REX.B either modifies the base in the ModR/M r/m field or SIB base field; or it modifies the opcode reg field
used for accessing GPRs.
ModRM Byte
Rrrr Bbbb
OM17Xfig1-3
Figure 2-4. Memory Addressing Without an SIB Byte; REX.X Not Used
ModRM Byte
Rrrr Bbbb
OM17Xfig1-4
Figure 2-5. Register-Register Addressing (No Memory Operand); REX.X Not Used
Vol. 2A 2-9
INSTRUCTION FORMAT
Bbbb
OM17Xfig1-6
Figure 2-7. Register Operand Coded in Opcode Byte; REX.X & REX.R Not Used
In the IA-32 architecture, byte registers (AH, AL, BH, BL, CH, CL, DH, and DL) are encoded in the ModR/M byte’s
reg field, the r/m field or the opcode reg field as registers 0 through 7. REX prefixes provide an additional
addressing capability for byte-registers that makes the least-significant byte of GPRs available for byte operations.
Certain combinations of the fields of the ModR/M byte and the SIB byte have special meaning for register encod-
ings. For some combinations, fields expanded by the REX prefix are not decoded. Table 2-5 describes how each
case behaves.
2-10 Vol. 2A
INSTRUCTION FORMAT
2.2.1.3 Displacement
Addressing in 64-bit mode uses existing 32-bit ModR/M and SIB encodings. The ModR/M and SIB displacement
sizes do not change. They remain 8 bits or 32 bits and are sign-extended to 64 bits.
2.2.1.5 Immediates
In 64-bit mode, the typical size of immediate operands remains 32 bits. When the operand size is 64 bits, the
processor sign-extends all immediates to 64 bits prior to their use.
Support for 64-bit immediate operands is accomplished by expanding the semantics of the existing move (MOV
reg, imm16/32) instructions. These instructions (opcodes B8H – BFH) move 16-bits or 32-bits of immediate data
(depending on the effective operand size) into a GPR. When the effective operand size is 64 bits, these instructions
can be used to load an immediate into a GPR. A REX prefix is needed to override the 32-bit default operand size to
a 64-bit operand size.
For example:
Vol. 2A 2-11
INSTRUCTION FORMAT
The ModR/M encoding for RIP-relative addressing does not depend on using a prefix. Specifically, the r/m bit field
encoding of 101B (used to select RIP-relative addressing) is not affected by the REX prefix. For example, selecting
R13 (REX.B = 1, r/m = 101B) with mod = 00B still results in RIP-relative addressing. The 4-bit r/m field of REX.B
combined with ModR/M is not fully decoded. In order to address R13 with no displacement, software must encode
R13 + 0 using a 1-byte displacement of zero.
RIP-relative addressing is enabled by 64-bit mode, not by a 64-bit address-size. The use of the address-size prefix
does not disable RIP-relative addressing. The effect of the address-size prefix is to truncate and zero-extend the
computed effective address to 32 bits.
2-12 Vol. 2A
INSTRUCTION FORMAT
Vol. 2A 2-13
INSTRUCTION FORMAT
2-14 Vol. 2A
INSTRUCTION FORMAT
7 0 7 6 3 2 1 0
L: Vector Length
0: scalar or 128-bit vector
1: 256-bit vector
The following subsections describe the various fields in two or three-byte VEX prefix.
Vol. 2A 2-15
INSTRUCTION FORMAT
2.3.5.6 2-byte VEX Byte 1, bits[6:3] and 3-byte VEX Byte 2, bits [6:3]- ‘vvvv’ the Source or Dest
Register Specifier
In 32-bit mode the VEX first byte C4 and C5 alias onto the LES and LDS instructions. To maintain compatibility with
existing programs the VEX 2nd byte, bits [7:6] must be 11b. To achieve this, the VEX payload bits are selected to
place only inverted, 64-bit valid fields (extended register selectors) in these upper bits.
The 2-byte VEX Byte 1, bits [6:3] and the 3-byte VEX, Byte 2, bits [6:3] encode a field (shorthand VEX.vvvv) that
for instructions with 2 or more source registers and an XMM or YMM or memory destination encodes the first source
register specifier stored in inverted (1’s complement) form.
VEX.vvvv is not used by the instructions with one source (except certain shifts, see below) or on instructions with
no XMM or YMM or memory destination. If an instruction does not use VEX.vvvv then it should be set to 1111b
otherwise instruction will #UD.
In 64-bit mode all 4 bits may be used. See Table 2-8 for the encoding of the XMM or YMM registers. In 32-bit and
16-bit modes bit 6 must be 1 (if bit 6 is not 1, the 2-byte VEX version will generate LDS instruction and the 3-byte
VEX version will ignore this bit).
2-16 Vol. 2A
INSTRUCTION FORMAT
The VEX.vvvv field is encoded in bit inverted format for accessing a register operand.
Vol. 2A 2-17
INSTRUCTION FORMAT
VEX.m-mmmm is only available on the 3-byte VEX. The 2-byte VEX implies a leading 0Fh opcode byte.
2.3.6.2 2-byte VEX byte 1, bit[2], and 3-byte VEX byte 2, bit [2]- “L”
The vector length field, VEX.L, is encoded in bit[2] of either the second byte of 2-byte VEX, or the third byte of 3-
byte VEX. If “VEX.L = 1”, it indicates 256-bit vector operation. “VEX.L = 0” indicates scalar and 128-bit vector
operations.
The instruction VZEROUPPER is a special case that is encoded with VEX.L = 0, although its operation zero’s bits
255:128 of all YMM registers accessible in the current operating mode.
See the following table.
2.3.6.3 2-byte VEX byte 1, bits[1:0], and 3-byte VEX byte 2, bits [1:0]- “pp”
Up to one implied prefix is encoded by bits[1:0] of either the 2-byte VEX byte 1 or the 3-byte VEX byte 2. The prefix
behaves as if it was encoded prior to VEX, but after all other encoded prefixes.
See the following table.
2-18 Vol. 2A
INSTRUCTION FORMAT
Vol. 2A 2-19
INSTRUCTION FORMAT
if XSAVE and XRSTOR are used with a save/restore mask that does not set bits corresponding to all supported
extensions to the vector registers.)
memory addressing to an array of linear addresses. VSIB addressing is only supported in a subset of Intel AVX2
instructions. VSIB memory addressing requires 32-bit or 64-bit effective address. In 32-bit mode, VSIB addressing
is not supported when address size attribute is overridden to 16 bits. In 16-bit protected mode, VSIB memory
addressing is permitted if address size attribute is overridden to 32 bits. Additionally, VSIB memory addressing is
supported only with VEX prefix.
In VSIB memory addressing, the SIB byte consists of:
• The scale field (bit 7:6) specifies the scale factor.
• The index field (bits 5:3) specifies the register number of the vector index register, each element in the vector
register specifies an index.
• The base field (bits 2:0) specifies the register number of the base register.
Table 2-3 shows the 32-bit VSIB addressing form. It is organized to give 256 possible values of the SIB byte (in
hexadecimal). General purpose registers used as a base are indicated across the top of the table, along with corre-
sponding values for the SIB byte’s base field. The register names also include R8L-R15L applicable only in 64-bit
mode (when address size override prefix is used, but the value of VEX.B is not shown in Table 2-3). In 32-bit mode,
R8L-R15L does not apply.
Table rows in the body of the table indicate the vector index register used as the index field and each supported
scaling factor shown separately. Vector registers used in the index field can be XMM or YMM registers. The left-
most column includes vector registers VR8-VR15 (i.e. XMM8/YMM8-XMM15/YMM15), which are only available in
64-bit mode and does not apply if encoding in 32-bit mode.
2-20 Vol. 2A
INSTRUCTION FORMAT
Table 2-13. 32-Bit VSIB Addressing Forms of the SIB Byte (Contd.)
VR0/VR8 *4 10 000 80 81 82 83 84 85 86 87
VR1/VR9 001 88 89 8A 8B 8C 8D 8E 8F
VR2/VR10 010 90 91 92 93 94 95 96 97
VR3/VR11 011 98 89 9A 9B 9C 9D 9E 9F
VR4/VR12 100 A0 A1 A2 A3 A4 A5 A6 A7
VR5/VR13 101 A8 A9 AA AB AC AD AE AF
VR6/VR14 110 B0 B1 B2 B3 B4 B5 B6 B7
VR7/VR15 111 B8 B9 BA BB BC BD BE BF
VR0/VR8 *8 11 000 C0 C1 C2 C3 C4 C5 C6 C7
VR1/VR9 001 C8 C9 CA CB CC CD CE CF
VR2/VR10 010 D0 D1 D2 D3 D4 D5 D6 D7
VR3/VR11 011 D8 D9 DA DB DC DD DE DF
VR4/VR12 100 E0 E1 E2 E3 E4 E5 E6 E7
VR5/VR13 101 E8 E9 EA EB EC ED EE EF
VR6/VR14 110 F0 F1 F2 F3 F4 F5 F6 F7
VR7/VR15 111 F8 F9 FA FB FC FD FE FF
NOTES:
1. If ModR/M.mod = 00b, the base address is zero, then effective address is computed as [scaled vector index] + disp32. Otherwise the
base address is computed as [EBP/R13]+ disp, the displacement is either 8 bit or 32 bit depending on the value of ModR/M.mod:
MOD Effective Address
00b [Scaled Vector Register] + Disp32
01b [Scaled Vector Register] + Disp8 + [EBP/R13]
10b [Scaled Vector Register] + Disp32 + [EBP/R13]
NOTE
Instructions that operate only with MMX, X87, or general-purpose registers are not covered by the
exception classes defined in this section. For instructions that operate on MMX registers, see
Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
Vol. 2A 2-21
INSTRUCTION FORMAT
2-22 Vol. 2A
INSTRUCTION FORMAT
(*) - Additional exception restrictions are present - see the Instruction description for details
Vol. 2A 2-23
INSTRUCTION FORMAT
(**) - Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s, i.e. no
alignment checks are performed.
(***) - PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM and LDDQU instructions do not cause #GP if the memory operand is not
aligned to 16-Byte boundary.
Table 2-15 classifies exception behaviors for AVX instructions. Within each class of exception conditions that are
listed in Table 2-18 through Table 2-27, certain subsets of AVX instructions may be subject to #UD exception
depending on the encoded value of the VEX.L field. Table 2-17 provides supplemental information of AVX instruc-
tions that may be subject to #UD exception if encoded with incorrect values in the VEX.W or VEX.L field.
2-24 Vol. 2A
INSTRUCTION FORMAT
Type 3
VMASKMOVDQU, VMPSADBW, VPABSB/W/D, VPCMP(E/I)STRI/M,
VPACKSSWB/DW, VPACKUSWB/DW, VPADDB/W/D, PHMINPOSUW
VPADDQ, VPADDSB/W, VPADDUSB/W, VPALIGNR, VPAND,
VPANDN, VPAVGB/W, VPBLENDVB, VPBLENDW,
VPCMP(E/I)STRI/M, VPCMPEQB/W/D/Q, VPCMPGTB/W/D/Q,
VPHADDW/D, VPHADDSW, VPHMINPOSUW, VPHSUBD/W,
VPHSUBSW, VPMADDWD, VPMADDUBSW, VPMAXSB/W/D,
Type 4
VPMAXUB/W/D, VPMINSB/W/D, VPMINUB/W/D,
VPMULHUW, VPMULHRSW, VPMULHW/LW, VPMULLD,
VPMULUDQ, VPMULDQ, VPOR, VPSADBW, VPSHUFB/D,
VPSHUFHW/LW, VPSIGNB/W/D, VPSLLW/D/Q, VPSRAW/D,
VPSRLW/D/Q, VPSUBB/W/D/Q, VPSUBSB/W,
VPUNPCKHBW/WD/DQ, VPUNPCKHQDQ,
VPUNPCKLBW/WD/DQ, VPUNPCKLQDQ, VPXOR
VEXTRACTPS, VINSERTPS, VMOVD, VMOVQ, VMOVLPD, Same as column 3
VMOVLPS, VMOVHPD, VMOVHPS, VPEXTRB, VPEXTRD,
Type 5
VPEXTRW, VPEXTRQ, VPINSRB, VPINSRD, VPINSRW,
VPINSRQ, VPMOVSX/ZX, VLDMXCSR, VSTMXCSR
VEXTRACTF128,
VPERM2F128,
Type 6 VBROADCASTSD,
VBROADCASTF128,
VINSERTF128,
VMOVLHPS, VMOVHLPS, VPMOVMSKB, VPSLLDQ, VMOVLHPS, VMOVHLPS
Type 7 VPSRLDQ, VPSLLW, VPSLLD, VPSLLQ, VPSRAW, VPSRAD,
VPSRLW, VPSRLD, VPSRLQ
Type 8
Type 11
Type 12
Vol. 2A 2-25
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Real
Exception Cause of Exception
X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ≠ ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode, Legacy SSE instruction:
#UD X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail-
X X X X If CR0.TS[bit 3]=1.
able, #NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
VEX.256: Memory operand is not 32-byte aligned.
X X
VEX.128: Memory operand is not 16-byte aligned.
X X X X Legacy SSE: Memory operand is not 16-byte aligned.
General Protec-
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
tion, #GP(0) X
ments.
X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
2-26 Vol. 2A
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual 8086
64-bit
Real
Exception Cause of Exception
X X VEX prefix.
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
VEX prefix:
X X If XCR0[2:1] ≠ ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode,
Legacy SSE instruction:
#UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail-
X X X X If CR0.TS[bit 3]=1.
able, #NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
X X X X Legacy SSE: Memory operand is not 16-byte aligned.
General Protec- X For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
tion, #GP(0) X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
SIMD Floating-
point Exception, X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 1.
#XM
Vol. 2A 2-27
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
X X VEX prefix.
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
VEX prefix:
X X If XCR0[2:1] ≠ ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode, #UD Legacy SSE instruction:
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
SIMD Floating-point
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 1.
Exception, #XM
2-28 Vol. 2A
INSTRUCTION FORMAT
2.4.4 Exceptions Type 4 (>=16 Byte mem arg no alignment, no floating-point exceptions)
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ≠ ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Legacy SSE instruction:
Invalid Opcode, #UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
X X X X Legacy SSE: Memory operand is not 16-byte aligned.1
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
General Protection, ments.
#GP(0) X If the memory address is in a non-canonical form.
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
NOTES:
1. LDDQU, MOVUPD, MOVUPS, PCMPESTRI, PCMPESTRM, PCMPISTRI, and PCMPISTRM instructions do not cause #GP if the memory
operand is not aligned to 16-Byte boundary.
Vol. 2A 2-29
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ≠ ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Legacy SSE instruction:
Invalid Opcode, #UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
2-30 Vol. 2A
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Real
Exception Cause of Exception
X X VEX prefix.
If XCR0[2:1] ≠ ‘11b’.
X X
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode, #UD
X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
General Protection, X
ments.
#GP(0)
X If the memory address is in a non-canonical form.
Page Fault
X X For a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
Vol. 2A 2-31
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ≠ ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Legacy SSE instruction:
Invalid Opcode, #UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X If CR0.TS[bit 3]=1.
#NM
64-bit
Real
2-32 Vol. 2A
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
Vol. 2A 2-33
INSTRUCTION FORMAT
2.4.10 Exceptions Type 12 (VEX-only, VSIB mem arg, no AC, no floating-point exceptions)
Protected and
Compatibility
Virtual-8086
64-bit
Real
Exception Cause of Exception
2-34 Vol. 2A
INSTRUCTION FORMAT
Any VEX-encoded GPR instruction with a 66H, F2H, or F3H prefix preceding VEX will #UD.
Any VEX-encoded GPR instruction with a REX prefix proceeding VEX will #UD.
VEX-encoded GPR instructions are not supported in real and virtual 8086 modes.
(*) - Additional exception restrictions are present - see the Instruction description for details.
64-bit
Real
Vol. 2A 2-35
INSTRUCTION FORMAT
The significant feature differences between EVEX and VEX are summarized below.
• EVEX is a 4-Byte prefix (the first byte must be 62H); VEX is either a 2-Byte (C5H is the first byte) or 3-Byte
(C4H is the first byte) prefix.
• EVEX prefix can encode 32 vector registers (XMM/YMM/ZMM) in 64-bit mode.
• EVEX prefix can encode an opmask register for conditional processing or selection control in EVEX-encoded
vector instructions. Opmask instructions, whose source/destination operands are opmask registers and treat
the content of an opmask register as a single value, are encoded using the VEX prefix.
• EVEX memory addressing with disp8 form uses a compressed disp8 encoding scheme to improve the encoding
density of the instruction byte stream.
• EVEX prefix can encode functionality that are specific to instruction classes (e.g., packed instruction with
“load+op” semantic can support embedded broadcast functionality, floating-point instruction with rounding
semantic can support static rounding functionality, floating-point instruction with non-rounding arithmetic
semantic can support “suppress all exceptions” functionality).
# of bytes: 4 1 1 1 2, 4 1
[Prefixes] EVEX Opcode ModR/M [SIB] [Disp16,32] [Immediate]
1
[Disp8*N]
The EVEX prefix is a 4-byte prefix, with the first two bytes derived from unused encoding form of the 32-bit-mode-
only BOUND instruction. The layout of the EVEX prefix is shown in Figure 2-11. The first byte must be 62H, followed
by three payload bytes, denoted as P0, P1, and P2 individually or collectively as P[23:0] (see Figure 2-11).
EVEX 62H P0 P1 P2
7 6 5 4 3 2 1 0
P0 R X B R’ 0 0 m m P[7:0]
7 6 5 4 3 2 1 0
P1 W v v v v 1 p p P[15:8]
7 6 5 4 3 2 1 0
P2 z L’ L b V’ a a a P[23:16]
2-36 Vol. 2A
INSTRUCTION FORMAT
The bit fields in P[23:0] are divided into the following functional groups (Table 2-30 provides a tabular summary):
• Reserved bits: P[3:2] must be 0, otherwise #UD.
• Fixed-value bit: P[10] must be 1, otherwise #UD.
• Compressed legacy prefix/escape bytes: P[1:0] is identical to the lowest 2 bits of VEX.mmmmm; P[9:8] is
identical to VEX.pp.
• Operand specifier modifier bits for vector register, general purpose register, memory addressing: P[7:5] allows
access to the next set of 8 registers beyond the low 8 registers when combined with ModR/M register specifiers.
• Operand specifier modifier bit for vector register: P[4] (or EVEX.R’) allows access to the high 16 vector register
set when combined with P[7] and ModR/M.reg specifier; P[6] can also provide access to a high 16 vector
register when SIB or VSIB addressing are not needed.
• Non-destructive source /vector index operand specifier: P[19] and P[14:11] encode the second source vector
register operand in a non-destructive source syntax, vector index register operand can access an upper 16
vector register using P[19].
• Op-mask register specifiers: P[18:16] encodes op-mask register set k0-k7 in instructions operating on vector
registers.
• EVEX.W: P[15] is similar to VEX.W which serves either as opcode extension bit or operand size promotion to
64-bit in 64-bit mode.
• Vector destination merging/zeroing: P[23] encodes the destination result behavior which either zeroes the
masked elements or leave masked element unchanged.
• Broadcast/Static-rounding/SAE context bit: P[20] encodes multiple functionality, which differs across different
classes of instructions and can affect the meaning of the remaining field (EVEX.L’L). The functionality for the
following instruction classes are:
— Broadcasting a single element across the destination vector register: this applies to the instruction class
with Load+Op semantic where one of the source operand is from memory.
— Redirect L’L field (P[22:21]) as static rounding control for floating-point instructions with rounding
semantic. Static rounding control overrides MXCSR.RC field and implies “Suppress all exceptions” (SAE).
— Enable SAE for floating -point instructions with arithmetic semantic that is not rounding.
— For instruction classes outside of the afore-mentioned three classes, setting EVEX.b will cause #UD.
Vol. 2A 2-37
INSTRUCTION FORMAT
• Vector length/rounding control specifier: P[22:21] can serve one of three options.
— Vector length information for packed vector instructions.
— Ignored for instructions operating on vector register content as a single data element.
— Rounding control for floating-point instructions that have a rounding semantic and whose source and
destination operands are all vector registers.
Table 2-31. 32-Register Support in 64-bit Mode Using EVEX with Embedded REX Bits
41 3 [2:0] Reg. Type Common Usages
REG EVEX.R’ REX.R modrm.reg GPR, Vector Destination or Source
VVVV EVEX.V’ EVEX.vvvv GPR, Vector 2ndSource or Destination
RM EVEX.X EVEX.B modrm.r/m GPR, Vector 1st Source or Destination
BASE 0 EVEX.B modrm.r/m GPR memory addressing
INDEX 0 EVEX.X sib.index GPR memory addressing
VIDX EVEX.V’ EVEX.X sib.index Vector VSIB memory addressing
NOTES:
1. Not applicable for accessing general purpose registers.
The mapping of register operands used by various instruction syntax and memory addressing in 32-bit modes are
shown in Table 2-32.
2-38 Vol. 2A
INSTRUCTION FORMAT
• An opmask register serving as the destination or source operand of a vector instruction is encoded using
standard modR/M byte’s reg field and rm fields.
Vol. 2A 2-39
INSTRUCTION FORMAT
numerical precision from a wider format to narrower format). EVEX.b is supported for such instructions for data
element sizes which are either dword or qword (see Section 2.6.11).
EVEX-encoded instruction that are pure load/store, and “Load+op” instruction semantic that operate on data
element size less then dword do not support broadcasting using EVEX.b. These are listed in Table 2-35. Table 2-35
also includes many broadcast instructions which perform broadcast using a subset of data elements without using
EVEX.b. These instructions and a few data element size conversion instruction are covered in Table 2-35. Instruc-
tion classified in Table 2-35 do not use EVEX.b and EVEX.b must be 0, otherwise #UD will occur.
The tupletype will be referenced in the instruction operand encoding table in the reference page of each instruction,
providing the cross reference for the scaling factor N to encoding memory addressing operand.
Note that the disp8*N rules still apply when using 16b addressing.
1 64bit 1 {1tox} 8 8 8
0 32bit 0 none 8 16 32
Half Load+Op (Half Vector)
1 32bit 0 {1tox} 4 4 4
Table 2-35. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast
TupleType InputSize EVEX.W N (VL= 128) N (VL= 256) N (VL= 512) Comment
Full Mem N/A N/A 16 32 64 Load/store or subDword full vector
8bit N/A 1 1 1
16bit N/A 2 2 2
Tuple1 Scalar 1Tuple
32bit 0 4 4 4
64bit 1 8 8 8
32bit N/A 4 4 4 1 Tuple, memsize not affected by
Tuple1 Fixed
64bit N/A 8 8 8 EVEX.W
32bit 0 8 8 8
Tuple2 Broadcast (2 elements)
64bit 1 NA 16 16
32bit 0 NA 16 16
Tuple4 Broadcast (4 elements)
64bit 1 NA NA 32
Tuple8 32bit 0 NA NA 32 Broadcast (8 elements)
Half Mem N/A N/A 8 16 32 SubQword Conversion
Quarter Mem N/A N/A 4 8 16 SubDword Conversion
Eighth Mem N/A N/A 2 4 8 SubWord Conversion
Mem128 N/A N/A 16 16 16 Shift count from memory
MOVDDUP N/A N/A 8 32 64 VMOVDDUP
2-40 Vol. 2A
INSTRUCTION FORMAT
Vol. 2A 2-41
INSTRUCTION FORMAT
Table 2-36. EVEX Embedded Broadcast/Rounding/SAE and Vector Length on Vector Instructions
Position P2[4] P2[6:5] P2[6:5]
Broadcast/Rounding/SAE Context EVEX.b EVEX.L’L EVEX.RC
Reg-reg, FP Instructions w/ rounding semantic Enable static rounding Vector length Implied 00b: SAE + RNE
control (SAE implied) (512 bit or scalar) 01b: SAE + RD
10b: SAE + RU
11b: SAE + RZ
FP Instructions w/o rounding semantic, can cause #XM SAE control 00b: 128-bit NA
01b: 256-bit
Load+op Instructions w/ memory source Broadcast Control NA
10b: 512-bit
Other Instructions ( Must be 0 (otherwise 11b: Reserved (#UD) NA
Explicit Load/Store/Broadcast/Gather/Scatter) #UD)
2-42 Vol. 2A
INSTRUCTION FORMAT
Table 2-40 lists the #UD conditions of instruction encoding of opmask register using EVEX.aaa and EVEX.z
Vol. 2A 2-43
INSTRUCTION FORMAT
Table 2-41 lists the #UD conditions of EVEX bit fields that depends on the context of EVEX.b.
2-44 Vol. 2A
INSTRUCTION FORMAT
Vol. 2A 2-45
INSTRUCTION FORMAT
2-46 Vol. 2A
INSTRUCTION FORMAT
Protected and
Compatibility
64-bit
Real
General Protection, If fault suppression not set, and an illegal memory operand effective address in the
X
#GP(0) CS, DS, ES, FS or GS segments.
X If fault suppression not set, and the memory address is in a non-canonical form.
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault
X X X If fault suppression not set, and a page fault.
#PF(fault-code)
Vol. 2A 2-47
INSTRUCTION FORMAT
EVEX-encoded instructions with memory alignment restrictions, but do not support memory fault suppression
follow exception class E1NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
2-48 Vol. 2A
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual 8086
64-bit
Real
Vol. 2A 2-49
INSTRUCTION FORMAT
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
2-50 Vol. 2A
INSTRUCTION FORMAT
EVEX-encoded scalar instructions with arithmetic semantic that do not support memory fault suppression follow
exception class E3NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
X X EVEX prefix.
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-37 not met.
X X • Opcode independent #UD condition in Table 2-38.
Invalid Opcode, #UD • Operand encoding #UD conditions in Table 2-39.
• Opmask encoding #UD condition of Table 2-40.
• If EVEX.b != 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault #PF(fault-
X X X For a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
SIMD Floating-point If an unmasked SIMD floating-point exception, {sae} or {er} not set, and CR4.OSX-
X X X X
Exception, #XM MMEXCPT[bit 10] = 1.
Vol. 2A 2-51
INSTRUCTION FORMAT
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
2-52 Vol. 2A
INSTRUCTION FORMAT
EVEX-encoded vector instructions that do not cause SIMD FP exception nor support memory fault suppression
follow exception class E4NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
Vol. 2A 2-53
INSTRUCTION FORMAT
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
EVEX-encoded scalar/partial vector instructions that do not cause SIMD FP exception nor support memory fault
suppression follow exception class E5NF.
2-54 Vol. 2A
INSTRUCTION FORMAT
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-55
INSTRUCTION FORMAT
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
2-56 Vol. 2A
INSTRUCTION FORMAT
EVEX-encoded instructions that do not cause SIMD FP exception nor support memory fault suppression follow
exception class E6NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
Vol. 2A 2-57
INSTRUCTION FORMAT
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
2-58 Vol. 2A
INSTRUCTION FORMAT
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-59
INSTRUCTION FORMAT
EVEX-encoded vector or partial-vector instructions that must be encoded with VEX.L’L = 0, do not cause SIMD FP
exception nor support memory fault suppression follow exception class E9NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
2-60 Vol. 2A
INSTRUCTION FORMAT
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-61
INSTRUCTION FORMAT
EVEX-encoded scalar instructions that must be encoded with VEX.L’L = 0, do not cause SIMD FP exception nor
support memory fault suppression follow exception class E10NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
2-62 Vol. 2A
INSTRUCTION FORMAT
2.7.10 Exception Type E11 (EVEX-only, mem arg no AC, floating-point exceptions)
EVEX-encoded instructions that can cause SIMD FP exception, memory operand support fault suppression but do
not cause #AC follow exception class E11.
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-63
INSTRUCTION FORMAT
2.7.11 Exception Type E12 and E12NP (VSIB mem arg, no AC, no floating-point exceptions)
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
2-64 Vol. 2A
INSTRUCTION FORMAT
EVEX-encoded prefetch instructions that do not cause #PF follow exception class E12NP.
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-65
INSTRUCTION FORMAT
Table 2-63. TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg)
Virtual 80x86
Protected and
Compatibility
64-bit
Real
2-66 Vol. 2A
INSTRUCTION FORMAT
Exception conditions of Opmask instructions that address memory are listed as Type K21.
Table 2-64. TYPE K21 Exception Definition (VEX-Encoded OpMask Instructions Addressing Memory)
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
Vol. 2A 2-67
INSTRUCTION FORMAT
2-68 Vol. 2A
7. Updates to Chapter 3, Volume 2A
Change bars and green text show changes to Chapter 3 of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 2A: Instruction Set Reference, A-L.
------------------------------------------------------------------------------------------
Updates to the following instructions: AESDEC, AESDECLAST, AESENC, AESENCLAST, CLFLUSH, CLFLUSHOPT,
CLWB, CMOVcc, CMPPS, CPUID, FLD, GF2P8AFFINEINVQB, GF2P8AFFINEQB, GF2P8MULB, INC, and INVPCID.
Note that CPUID has several updates for new instructions, CET, and leaf 1AH.
Added Control-flow Enforcement Technology details to the following instructions: CALL, INT n/INTO/INT3/INT1,
IRET/IRETD/IRETQ, and JMP.
Description
This instruction performs a single round of the AES decryption flow using the Equivalent Inverse Cipher, with the
round key from the second source operand, operating on a 128-bit data (state) from the first source operand, and
store the result in the destination operand.
Use the AESDEC instruction for all but the last decryption round. For the last decryption round, use the AESDE-
CLAST instruction.
VEX and EVEX encoded versions of the instruction allow 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.
Operation
AESDEC
STATE ← SRC1;
RoundKey ← SRC2;
STATE ← InvShiftRows( STATE );
STATE ← InvSubBytes( STATE );
STATE ← InvMixColumns( STATE );
DEST[127:0] ← STATE XOR RoundKey;
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
See Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
Description
This instruction performs the last round of the AES decryption flow using the Equivalent Inverse Cipher, with the
round key from the second source operand, operating on a 128-bit data (state) from the first source operand, and
store the result in the destination operand.
VEX and EVEX encoded versions of the instruction allow 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.
Operation
AESDECLAST
STATE ← SRC1;
RoundKey ← SRC2;
STATE ← InvShiftRows( STATE );
STATE ← InvSubBytes( STATE );
DEST[127:0] ← STATE XOR RoundKey;
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
See Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
Description
This instruction performs a single round of an AES encryption flow using a round key from the second source
operand, operating on 128-bit data (state) from the first source operand, and store the result in the destination
operand.
Use the AESENC instruction for all but the last encryption rounds. For the last encryption round, use the AESENC-
CLAST instruction.
VEX and EVEX encoded versions of the instruction allow 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.
Operation
AESENC
STATE ← SRC1;
RoundKey ← SRC2;
STATE ← ShiftRows( STATE );
STATE ← SubBytes( STATE );
STATE ← MixColumns( STATE );
DEST[127:0] ← STATE XOR RoundKey;
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
See Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
Description
This instruction performs the last round of an AES encryption flow using a round key from the second source
operand, operating on 128-bit data (state) from the first source operand, and store the result in the destination
operand.
VEX and EVEX encoded versions of the instruction allows 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.
Operation
AESENCLAST
STATE ← SRC1;
RoundKey ← SRC2;
STATE ← ShiftRows( STATE );
STATE ← SubBytes( STATE );
DEST[127:0] ← STATE XOR RoundKey;
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
See Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
CALL—Call Procedure
Opcode Instruction Op/ 64-bit Compat/ Description
En Mode Leg Mode
E8 cw CALL rel16 D N.S. Valid Call near, relative, displacement relative to next
instruction.
E8 cd CALL rel32 D Valid Valid Call near, relative, displacement relative to next
instruction. 32-bit displacement sign extended to
64-bits in 64-bit mode.
FF /2 CALL r/m16 M N.E. Valid Call near, absolute indirect, address given in r/m16.
FF /2 CALL r/m32 M N.E. Valid Call near, absolute indirect, address given in r/m32.
FF /2 CALL r/m64 M Valid N.E. Call near, absolute indirect, address given in r/m64.
9A cd CALL ptr16:16 D Invalid Valid Call far, absolute, address given in operand.
9A cp CALL ptr16:32 D Invalid Valid Call far, absolute, address given in operand.
FF /3 CALL m16:16 M Valid Valid Call far, absolute indirect address given in m16:16.
In 32-bit mode: if selector points to a gate, then RIP
= 32-bit zero extended displacement taken from
gate; else RIP = zero extended 16-bit offset from
far pointer referenced in the instruction.
FF /3 CALL m16:32 M Valid Valid In 64-bit mode: If selector points to a gate, then RIP
= 64-bit displacement taken from gate; else RIP =
zero extended 32-bit offset from far pointer
referenced in the instruction.
REX.W FF /3 CALL m16:64 M Valid N.E. In 64-bit mode: If selector points to a gate, then RIP
= 64-bit displacement taken from gate; else RIP =
64-bit offset from far pointer referenced in the
instruction.
Description
Saves procedure linking information on the stack and branches to the called procedure specified using the target
operand. The target operand specifies the address of the first instruction in the called procedure. The operand can
be an immediate value, a general-purpose register, or a memory location.
This instruction can be used to execute four types of calls:
• Near Call — A call to a procedure in the current code segment (the segment currently pointed to by the CS
register), sometimes referred to as an intra-segment call.
• Far Call — A call to a procedure located in a different segment than the current code segment, sometimes
referred to as an inter-segment call.
• Inter-privilege-level far call — A far call to a procedure in a segment at a different privilege level than that
of the currently executing program or procedure.
• Task switch — A call to a procedure located in a different task.
The latter two call types (inter-privilege-level call and task switch) can only be executed in protected mode. See
“Calling Procedures Using Call and RET” in Chapter 6 of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 1, for additional information on near, far, and inter-privilege-level calls. See Chapter 7,
“Task Management,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for infor-
mation on performing task switches with the CALL instruction.
Near Call. When executing a near call, the processor pushes the value of the EIP register (which contains the offset
of the instruction following the CALL instruction) on the stack (for use later as a return-instruction pointer). The
processor then branches to the address in the current code segment specified by the target operand. The target
operand specifies either an absolute offset in the code segment (an offset from the base of the code segment) or a
relative offset (a signed displacement relative to the current value of the instruction pointer in the EIP register; this
value points to the instruction following the CALL instruction). The CS register is not changed on near calls.
For a near call absolute, an absolute offset is specified indirectly in a general-purpose register or a memory location
(r/m16, r/m32, or r/m64). The operand-size attribute determines the size of the target operand (16, 32 or 64
bits). When in 64-bit mode, the operand size for near call (and all near branches) is forced to 64-bits. Absolute
offsets are loaded directly into the EIP(RIP) register. If the operand size attribute is 16, the upper two bytes of the
EIP register are cleared, resulting in a maximum instruction pointer size of 16 bits. When accessing an absolute
offset indirectly using the stack pointer [ESP] as the base register, the base value used is the value of the ESP
before the instruction executes.
A relative offset (rel16 or rel32) is generally specified as a label in assembly code. But at the machine code level, it
is encoded as a signed, 16- or 32-bit immediate value. This value is added to the value in the EIP(RIP) register. In
64-bit mode the relative offset is always a 32-bit immediate value which is sign extended to 64-bits before it is
added to the value in the RIP register for the target calculation. As with absolute offsets, the operand-size attribute
determines the size of the target operand (16, 32, or 64 bits). In 64-bit mode the target operand will always be 64-
bits because the operand size is forced to 64-bits for near branches.
Far Calls in Real-Address or Virtual-8086 Mode. When executing a far call in real- address or virtual-8086 mode, the
processor pushes the current value of both the CS and EIP registers on the stack for use as a return-instruction
pointer. The processor then performs a “far branch” to the code segment and offset specified with the target
operand for the called procedure. The target operand specifies an absolute far address either directly with a pointer
(ptr16:16 or ptr16:32) or indirectly with a memory location (m16:16 or m16:32). With the pointer method, the
segment and offset of the called procedure is encoded in the instruction using a 4-byte (16-bit operand size) or 6-
byte (32-bit operand size) far address immediate. With the indirect method, the target operand specifies a memory
location that contains a 4-byte (16-bit operand size) or 6-byte (32-bit operand size) far address. The operand-size
attribute determines the size of the offset (16 or 32 bits) in the far address. The far address is loaded directly into
the CS and EIP registers. If the operand-size attribute is 16, the upper two bytes of the EIP register are cleared.
Far Calls in Protected Mode. When the processor is operating in protected mode, the CALL instruction can be used to
perform the following types of far calls:
• Far call to the same privilege level
• Far call to a different privilege level (inter-privilege level call)
• Task switch (far call to another task)
In protected mode, the processor always uses the segment selector part of the far address to access the corre-
sponding descriptor in the GDT or LDT. The descriptor type (code segment, call gate, task gate, or TSS) and access
rights determine the type of call operation to be performed.
If the selected descriptor is for a code segment, a far call to a code segment at the same privilege level is
performed. (If the selected code segment is at a different privilege level and the code segment is non-conforming,
a general-protection exception is generated.) A far call to the same privilege level in protected mode is very similar
to one carried out in real-address or virtual-8086 mode. The target operand specifies an absolute far address either
directly with a pointer (ptr16:16 or ptr16:32) or indirectly with a memory location (m16:16 or m16:32). The
operand- size attribute determines the size of the offset (16 or 32 bits) in the far address. The new code segment
selector and its descriptor are loaded into CS register; the offset from the instruction is loaded into the EIP register.
A call gate (described in the next paragraph) can also be used to perform a far call to a code segment at the same
privilege level. Using this mechanism provides an extra level of indirection and is the preferred method of making
calls between 16-bit and 32-bit code segments.
When executing an inter-privilege-level far call, the code segment for the procedure being called must be accessed
through a call gate. The segment selector specified by the target operand identifies the call gate. The target
operand can specify the call gate segment selector either directly with a pointer (ptr16:16 or ptr16:32) or indirectly
with a memory location (m16:16 or m16:32). The processor obtains the segment selector for the new code
segment and the new instruction pointer (offset) from the call gate descriptor. (The offset from the target operand
is ignored when a call gate is used.)
On inter-privilege-level calls, the processor switches to the stack for the privilege level of the called procedure. The
segment selector for the new stack segment is specified in the TSS for the currently running task. The branch to the
new code segment occurs after the stack switch. (Note that when using a call gate to perform a far call to a
segment at the same privilege level, no stack switch occurs.) On the new stack, the processor pushes the segment
selector and stack pointer for the calling procedure’s stack, an optional set of parameters from the calling proce-
dures stack, and the segment selector and instruction pointer for the calling procedure’s code segment. (A value in
the call gate descriptor determines how many parameters to copy to the new stack.) Finally, the processor
branches to the address of the procedure being called within the new code segment.
Executing a task switch with the CALL instruction is similar to executing a call through a call gate. The target
operand specifies the segment selector of the task gate for the new task activated by the switch (the offset in the
target operand is ignored). The task gate in turn points to the TSS for the new task, which contains the segment
selectors for the task’s code and stack segments. Note that the TSS also contains the EIP value for the next instruc-
tion that was to be executed before the calling task was suspended. This instruction pointer value is loaded into the
EIP register to re-start the calling task.
The CALL instruction can also specify the segment selector of the TSS directly, which eliminates the indirection of
the task gate. See Chapter 7, “Task Management,” in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A, for information on the mechanics of a task switch.
When you execute at task switch with a CALL instruction, the nested task flag (NT) is set in the EFLAGS register and
the new TSS’s previous task link field is loaded with the old task’s TSS selector. Code is expected to suspend this
nested task by executing an IRET instruction which, because the NT flag is set, automatically uses the previous task
link to return to the calling task. (See “Task Linking” in Chapter 7 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A, for information on nested tasks.) Switching tasks with the CALL instruction differs
in this regard from JMP instruction. JMP does not set the NT flag and therefore does not expect an IRET instruction
to suspend the task.
Mixing 16-Bit and 32-Bit Calls. When making far calls between 16-bit and 32-bit code segments, use a call gate. If
the far call is from a 32-bit code segment to a 16-bit code segment, the call should be made from the first 64
KBytes of the 32-bit code segment. This is because the operand-size attribute of the instruction is set to 16, so only
a 16-bit return address offset can be saved. Also, the call should be made using a 16-bit call gate so that 16-bit
values can be pushed on the stack. See Chapter 21, “Mixing 16-Bit and 32-Bit Code,” in the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3B, for more information.
Far Calls in Compatibility Mode. When the processor is operating in compatibility mode, the CALL instruction can be
used to perform the following types of far calls:
• Far call to the same privilege level, remaining in compatibility mode
• Far call to the same privilege level, transitioning to 64-bit mode
• Far call to a different privilege level (inter-privilege level call), transitioning to 64-bit mode
Note that a CALL instruction can not be used to cause a task switch in compatibility mode since task switches are
not supported in IA-32e mode.
In compatibility mode, the processor always uses the segment selector part of the far address to access the corre-
sponding descriptor in the GDT or LDT. The descriptor type (code segment, call gate) and access rights determine
the type of call operation to be performed.
If the selected descriptor is for a code segment, a far call to a code segment at the same privilege level is
performed. (If the selected code segment is at a different privilege level and the code segment is non-conforming,
a general-protection exception is generated.) A far call to the same privilege level in compatibility mode is very
similar to one carried out in protected mode. The target operand specifies an absolute far address either directly
with a pointer (ptr16:16 or ptr16:32) or indirectly with a memory location (m16:16 or m16:32). The operand-size
attribute determines the size of the offset (16 or 32 bits) in the far address. The new code segment selector and its
descriptor are loaded into CS register and the offset from the instruction is loaded into the EIP register. The differ-
ence is that 64-bit mode may be entered. This specified by the L bit in the new code segment descriptor.
Note that a 64-bit call gate (described in the next paragraph) can also be used to perform a far call to a code
segment at the same privilege level. However, using this mechanism requires that the target code segment
descriptor have the L bit set, causing an entry to 64-bit mode.
When executing an inter-privilege-level far call, the code segment for the procedure being called must be accessed
through a 64-bit call gate. The segment selector specified by the target operand identifies the call gate. The target
operand can specify the call gate segment selector either directly with a pointer (ptr16:16 or ptr16:32) or indirectly
with a memory location (m16:16 or m16:32). The processor obtains the segment selector for the new code
segment and the new instruction pointer (offset) from the 16-byte call gate descriptor. (The offset from the target
operand is ignored when a call gate is used.)
On inter-privilege-level calls, the processor switches to the stack for the privilege level of the called procedure. The
segment selector for the new stack segment is set to NULL. The new stack pointer is specified in the TSS for the
currently running task. The branch to the new code segment occurs after the stack switch. (Note that when using
a call gate to perform a far call to a segment at the same privilege level, an implicit stack switch occurs as a result
of entering 64-bit mode. The SS selector is unchanged, but stack segment accesses use a segment base of 0x0,
the limit is ignored, and the default stack size is 64-bits. The full value of RSP is used for the offset, of which the
upper 32-bits are undefined.) On the new stack, the processor pushes the segment selector and stack pointer for
the calling procedure’s stack and the segment selector and instruction pointer for the calling procedure’s code
segment. (Parameter copy is not supported in IA-32e mode.) Finally, the processor branches to the address of the
procedure being called within the new code segment.
Near/(Far) Calls in 64-bit Mode. When the processor is operating in 64-bit mode, the CALL instruction can be used to
perform the following types of far calls:
• Far call to the same privilege level, transitioning to compatibility mode
• Far call to the same privilege level, remaining in 64-bit mode
• Far call to a different privilege level (inter-privilege level call), remaining in 64-bit mode
Note that in this mode the CALL instruction can not be used to cause a task switch in 64-bit mode since task
switches are not supported in IA-32e mode.
In 64-bit mode, the processor always uses the segment selector part of the far address to access the corresponding
descriptor in the GDT or LDT. The descriptor type (code segment, call gate) and access rights determine the type
of call operation to be performed.
If the selected descriptor is for a code segment, a far call to a code segment at the same privilege level is
performed. (If the selected code segment is at a different privilege level and the code segment is non-conforming,
a general-protection exception is generated.) A far call to the same privilege level in 64-bit mode is very similar to
one carried out in compatibility mode. The target operand specifies an absolute far address indirectly with a
memory location (m16:16, m16:32 or m16:64). The form of CALL with a direct specification of absolute far
address is not defined in 64-bit mode. The operand-size attribute determines the size of the offset (16, 32, or 64
bits) in the far address. The new code segment selector and its descriptor are loaded into the CS register; the offset
from the instruction is loaded into the EIP register. The new code segment may specify entry either into compati-
bility or 64-bit mode, based on the L bit value.
A 64-bit call gate (described in the next paragraph) can also be used to perform a far call to a code segment at the
same privilege level. However, using this mechanism requires that the target code segment descriptor have the L
bit set.
When executing an inter-privilege-level far call, the code segment for the procedure being called must be accessed
through a 64-bit call gate. The segment selector specified by the target operand identifies the call gate. The target
operand can only specify the call gate segment selector indirectly with a memory location (m16:16, m16:32 or
m16:64). The processor obtains the segment selector for the new code segment and the new instruction pointer
(offset) from the 16-byte call gate descriptor. (The offset from the target operand is ignored when a call gate is
used.)
On inter-privilege-level calls, the processor switches to the stack for the privilege level of the called procedure. The
segment selector for the new stack segment is set to NULL. The new stack pointer is specified in the TSS for the
currently running task. The branch to the new code segment occurs after the stack switch.
Note that when using a call gate to perform a far call to a segment at the same privilege level, an implicit stack
switch occurs as a result of entering 64-bit mode. The SS selector is unchanged, but stack segment accesses use
a segment base of 0x0, the limit is ignored, and the default stack size is 64-bits. (The full value of RSP is used for
the offset.) On the new stack, the processor pushes the segment selector and stack pointer for the calling proce-
dure’s stack and the segment selector and instruction pointer for the calling procedure’s code segment. (Parameter
copy is not supported in IA-32e mode.) Finally, the processor branches to the address of the procedure being called
within the new code segment.
Refer to Chapter 6, “Procedure Calls, Interrupts, and Exceptions” and Chapter 18, “Control-Flow Enforcement Tech-
nology (CET)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for CET details.
Instruction ordering. Instructions following a far call may be fetched from memory before earlier instructions
complete execution, but they will not execute (even speculatively) until all instructions prior to the far call have
completed execution (the later instructions may execute before data stored by the earlier instructions have become
globally visible).
Certain situations may lead to the next sequential instruction after a near indirect CALL being speculatively
executed. If software needs to prevent this (e.g., in order to prevent a speculative execution side channel), then an
INT3 or LFENCE instruction opcode can be placed after the near indirect CALL in order to block speculative execu-
tion.
Operation
IF near call
THEN IF near relative call
THEN
IF OperandSize = 64
THEN
tempDEST ← SignExtend(DEST); (* DEST is rel32 *)
tempRIP ← RIP + tempDEST;
IF stack not large enough for a 8-byte return address
THEN #SS(0); FI;
Push(RIP);
IF ShadowStackEnabled(CPL) AND DEST != 0
ShadowStackPush8B(RIP);
FI;
RIP ← tempRIP;
FI;
IF OperandSize = 32
THEN
tempEIP ← EIP + DEST; (* DEST is rel32 *)
IF tempEIP is not within code segment limit THEN #GP(0); FI;
IF stack not large enough for a 4-byte return address
THEN #SS(0); FI;
Push(EIP);
IF ShadowStackEnabled(CPL) AND DEST != 0
ShadowStackPush4B(EIP);
FI;
EIP ← tempEIP;
FI;
IF OperandSize = 16
THEN
tempEIP ← (EIP + DEST) AND 0000FFFFH; (* DEST is rel16 *)
IF tempEIP is not within code segment limit THEN #GP(0); FI;
IF stack not large enough for a 2-byte return address
THEN #SS(0); FI;
Push(IP);
IF ShadowStackEnabled(CPL) AND DEST != 0
(* IP is zero extended and pushed as a 32 bit value on shadow stack *)
ShadowStackPush4B(IP);
FI;
EIP ← tempEIP;
FI;
ELSE (* Near absolute call *)
IF OperandSize = 64
THEN
tempRIP ← DEST; (* DEST is r/m64 *)
IF stack not large enough for a 8-byte return address
THEN #SS(0); FI;
Push(RIP);
IF ShadowStackEnabled(CPL)
ShadowStackPush8B(RIP);
FI;
RIP ← tempRIP;
FI;
IF OperandSize = 32
THEN
tempEIP ← DEST; (* DEST is r/m32 *)
IF tempEIP is not within code segment limit THEN #GP(0); FI;
IF stack not large enough for a 4-byte return address
THEN #SS(0); FI;
Push(EIP);
IF ShadowStackEnabled(CPL)
ShadowStackPush4B(EIP);
FI;
EIP ← tempEIP;
FI;
IF OperandSize = 16
THEN
tempEIP ← DEST AND 0000FFFFH; (* DEST is r/m16 *)
IF tempEIP is not within code segment limit THEN #GP(0); FI;
IF stack not large enough for a 2-byte return address
THEN #SS(0); FI;
Push(IP);
IF ShadowStackEnabled(CPL)
(* IP is zero extended and pushed as a 32 bit value on shadow stack *)
ShadowStackPush4B(IP);
FI;
EIP ← tempEIP;
FI;
FI;rel/abs
IF (Call near indirect, absolute indirect)
IF EndbranchEnabledAndNotSuppressed(CPL)
IF CPL = 3
THEN
IF ( no 3EH prefix OR IA32_U_CET.NO_TRACK_EN == 0 )
THEN
IA32_U_CET.TRACKER = WAIT_FOR_ENDBRANCH
FI;
ELSE
IF ( no 3EH prefix OR IA32_S_CET.NO_TRACK_EN == 0 )
THEN
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
FI;
FI;
FI;
FI;
FI; near
IF far call and (PE = 0 or (PE = 1 and VM = 1)) (* Real-address or virtual-8086 mode *)
THEN
IF OperandSize = 32
THEN
IF stack not large enough for a 6-byte return address
THEN #SS(0); FI;
IF DEST[31:16] is not zero THEN #GP(0); FI;
Push(CS); (* Padded with 16 high-order bits *)
Push(EIP);
CS ← DEST[47:32]; (* DEST is ptr16:32 or [m16:32] *)
EIP ← DEST[31:0]; (* DEST is ptr16:32 or [m16:32] *)
ELSE (* OperandSize = 16 *)
IF stack not large enough for a 4-byte return address
THEN #SS(0); FI;
Push(CS);
Push(IP);
CS ← DEST[31:16]; (* DEST is ptr16:16 or [m16:16] *)
EIP ← DEST[15:0]; (* DEST is ptr16:16 or [m16:16]; clear upper 16 bits *)
FI;
FI;
IF far call and (PE = 1 and VM = 0) (* Protected mode or IA-32e Mode, not virtual-8086 mode*)
THEN
IF segment selector in target operand NULL
THEN #GP(0); FI;
IF segment selector index not within descriptor table limits
THEN #GP(new code segment selector); FI;
Read type and access rights of selected segment descriptor;
IF IA32_EFER.LMA = 0
THEN
IF segment type is not a conforming or nonconforming code segment, call
gate, task gate, or TSS
THEN #GP(segment selector); FI;
ELSE
IF segment type is not a conforming or nonconforming code segment or
64-bit call gate,
THEN #GP(segment selector); FI;
FI;
Depending on type and access rights:
GO TO CONFORMING-CODE-SEGMENT;
GO TO NONCONFORMING-CODE-SEGMENT;
GO TO CALL-GATE;
GO TO TASK-GATE;
GO TO TASK-STATE-SEGMENT;
FI;
CONFORMING-CODE-SEGMENT:
IF L bit = 1 and D bit = 1 and IA32_EFER.LMA = 1
THEN GP(new code segment selector); FI;
IF DPL > CPL
THEN #GP(new code segment selector); FI;
IF segment not present
THEN #NP(new code segment selector); FI;
NONCONFORMING-CODE-SEGMENT:
IF L-Bit = 1 and D-BIT = 1 and IA32_EFER.LMA = 1
THEN GP(new code segment selector); FI;
IF (RPL > CPL) or (DPL ≠ CPL)
THEN #GP(new code segment selector); FI;
IF segment not present
THEN #NP(new code segment selector); FI;
IF stack not large enough for return address
THEN #SS(0); FI;
tempEIP ← DEST(Offset);
IF target mode = Compatibility mode
THEN tempEIP ← tempEIP AND 00000000_FFFFFFFFH; FI;
IF OperandSize = 16
THEN tempEIP ← tempEIP AND 0000FFFFH; FI; (* Clear upper 16 bits *)
IF (EFER.LMA = 0 or target mode = Compatibility mode) and (tempEIP outside new code
segment limit)
THEN #GP(0); FI;
IF tempEIP is non-canonical
THEN #GP(0); FI;
IF ShadowStackEnabled(CPL)
IF OperandSize = 32
THEN
tempPushLIP = CSBASE + EIP;
ELSE
IF OperandSize = 16
THEN
tempPushLIP = CSBASE + IP;
ELSE (* OperandSize = 64 *)
tempPushLIP = RIP;
FI;
FI;
tempPushCS = CS;
FI;
IF OperandSize = 32
THEN
Push(CS); (* Padded with 16 high-order bits *)
Push(EIP);
CS ← DEST(CodeSegmentSelector);
(* Segment descriptor information also loaded *)
CS(RPL) ← CPL;
EIP ← tempEIP;
ELSE
IF OperandSize = 16
THEN
Push(CS);
Push(IP);
CS ← DEST(CodeSegmentSelector);
(* Segment descriptor information also loaded *)
CS(RPL) ← CPL;
EIP ← tempEIP;
ELSE (* OperandSize = 64 *)
Push(CS); (* Padded with 48 high-order bits *)
Push(RIP);
CS ← DEST(CodeSegmentSelector);
(* Segment descriptor information also loaded *)
CS(RPL) ← CPL;
RIP ← tempEIP;
FI;
FI;
IF ShadowStackEnabled(CPL)
IF (EFER.LMA and DEST(CodeSegmentSelector).L) = 0
(* If target is legacy or compatibility mode then the SSP must be in low 4GB *)
IF (SSP & 0xFFFFFFFF00000000 != 0)
THEN #GP(0); FI;
FI;
(* align to 8 byte boundary if not already aligned *)
tempSSP = SSP;
Shadow_stack_store 4 bytes of 0 to (SSP – 4)
SSP = SSP & 0xFFFFFFFFFFFFFFF8H
ShadowStackPush8B(tempPushCS); (* Padded with 48 high-order 0 bits *)
ShadowStackPush8B(tempPushLIP); (* Padded 32 high-order bits of 0 for 32 bit LIP*)
ShadowStackPush8B(tempSSP);
FI;
IF EndbranchEnabled(CPL)
IF CPL = 3
THEN
IA32_U_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_U_CET.SUPPRESS = 0
ELSE
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_S_CET.SUPPRESS = 0
FI;
FI;
END;
CALL-GATE:
IF call gate (DPL < CPL) or (RPL > DPL)
THEN #GP(call-gate selector); FI;
IF call gate not present
THEN #NP(call-gate selector); FI;
IF call-gate code-segment selector is NULL
THEN #GP(0); FI;
IF call-gate code-segment selector index is outside descriptor table limits
THEN #GP(call-gate code-segment selector); FI;
Read call-gate code-segment descriptor;
IF call-gate code-segment descriptor does not indicate a code segment
or call-gate code-segment descriptor DPL > CPL
THEN #GP(call-gate code-segment selector); FI;
IF IA32_EFER.LMA = 1 AND (call-gate code-segment descriptor is
not a 64-bit code segment or call-gate code-segment descriptor has both L-bit and D-bit set)
THEN #GP(call-gate code-segment selector); FI;
IF call-gate code segment not present
THEN #NP(call-gate code-segment selector); FI;
IF call-gate code segment is non-conforming and DPL < CPL
THEN go to MORE-PRIVILEGE;
ELSE go to SAME-PRIVILEGE;
FI;
END;
MORE-PRIVILEGE:
IF current TSS is 32-bit
THEN
TSSstackAddress ← (new code-segment DPL ∗ 8) + 4;
IF (TSSstackAddress + 5) > current TSS limit
THEN #TS(current TSS selector); FI;
NewSS ← 2 bytes loaded from (TSS base + TSSstackAddress + 4);
NewESP ← 4 bytes loaded from (TSS base + TSSstackAddress);
ELSE
IF current TSS is 16-bit
THEN
TSSstackAddress ← (new code-segment DPL ∗ 4) + 2
IF (TSSstackAddress + 3) > current TSS limit
THEN #TS(current TSS selector); FI;
NewSS ← 2 bytes loaded from (TSS base + TSSstackAddress + 2);
NewESP ← 2 bytes loaded from (TSS base + TSSstackAddress);
ELSE (* current TSS is 64-bit *)
TSSstackAddress ← (new code-segment DPL ∗ 8) + 4;
IF (TSSstackAddress + 7) > current TSS limit
THEN #TS(current TSS selector); FI;
NewSS ← new code-segment DPL; (* NULL selector with RPL = new CPL *)
NewRSP ← 8 bytes loaded from (current TSS base + TSSstackAddress);
FI;
FI;
IF IA32_EFER.LMA = 0 and NewSS is NULL
THEN #TS(NewSS); FI;
Read new stack-segment descriptor;
IF IA32_EFER.LMA = 0 and (NewSS RPL ≠ new code-segment DPL
or new stack-segment DPL ≠ new code-segment DPL or new stack segment is not a
writable data segment)
THEN #TS(NewSS); FI
IF IA32_EFER.LMA = 0 and new stack segment not present
THEN #SS(NewSS); FI;
IF CallGateSize = 32
THEN
IF new stack does not have room for parameters plus 16 bytes
THEN #SS(NewSS); FI;
IF CallGate(InstructionPointer) not within new code-segment limit
THEN #GP(0); FI;
SS ← newSS; (* Segment descriptor information also loaded *)
ESP ← newESP;
CS:EIP ← CallGate(CS:InstructionPointer);
(* Segment descriptor information also loaded *)
Push(oldSS:oldESP); (* From calling procedure *)
temp ← parameter count from call gate, masked to 5 bits;
Push(parameters from calling procedure’s stack, temp)
Push(oldCS:oldEIP); (* Return address to calling procedure *)
ELSE
IF CallGateSize = 16
THEN
IF new stack does not have room for parameters plus 8 bytes
THEN #SS(NewSS); FI;
IF (CallGate(InstructionPointer) AND FFFFH) not in new code-segment limit
THEN #GP(0); FI;
SS ← newSS; (* Segment descriptor information also loaded *)
ESP ← newESP;
CS:IP ← CallGate(CS:InstructionPointer);
(* Segment descriptor information also loaded *)
Push(oldSS:oldESP); (* From calling procedure *)
temp ← parameter count from call gate, masked to 5 bits;
Push(parameters from calling procedure’s stack, temp)
Push(oldCS:oldEIP); (* Return address to calling procedure *)
ELSE (* CallGateSize = 64 *)
IF pushing 32 bytes on the stack would use a non-canonical address
THEN #SS(NewSS); FI;
IF (CallGate(InstructionPointer) is non-canonical)
THEN #GP(0); FI;
SS ← NewSS; (* NewSS is NULL)
RSP ← NewESP;
CS:IP ← CallGate(CS:InstructionPointer);
(* Segment descriptor information also loaded *)
Push(oldSS:oldESP); (* From calling procedure *)
Push(oldCS:oldEIP); (* Return address to calling procedure *)
FI;
FI;
IF ShadowStackEnabled(CPL)
THEN
IF CPL = 3
THEN IA32_PL3_SSP ← SSP; FI;
FI;
CPL ← CodeSegment(DPL)
CS(RPL) ← CPL
IF ShadowStackEnabled(CPL)
oldSSP ← SSP
SSP ← IA32_PLi_SSP; (* where i is the CPL *)
IF SSP & 0x07 != 0 (* if SSP not aligned to 8 bytes then #GP *)
THEN #GP(0); FI;
IF ((EFER.LMA and CS.L) = 0 AND SSP[63:32] != 0)
THEN #GP(0); FI;
expected_token_value = SSP (* busy bit - bit position 0 - must be clear *)
new_token_value = SSP | BUSY_BIT (* Set the busy bit *)
IF shadow_stack_lock_cmpxchg8b(SSP, new_token_value, expected_token_value) != expected_token_value
THEN #GP(0); FI;
IF oldSS.DPL != 3
ShadowStackPush8B(oldCS); (* Padded with 48 high-order bits of 0 *)
ShadowStackPush8B(oldCSBASE+oldRIP); (* Padded with 32 high-order bits of 0 for 32 bit LIP*)
ShadowStackPush8B(oldSSP);
FI;
FI;
IF EndbranchEnabled (CPL)
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_S_CET.SUPPRESS = 0
FI;
END;
SAME-PRIVILEGE:
IF CallGateSize = 32
THEN
IF stack does not have room for 8 bytes
THEN #SS(0); FI;
IF CallGate(InstructionPointer) not within code segment limit
THEN #GP(0); FI;
CS:EIP ← CallGate(CS:EIP) (* Segment descriptor information also loaded *)
Push(oldCS:oldEIP); (* Return address to calling procedure *)
ELSE
If CallGateSize = 16
THEN
IF stack does not have room for 4 bytes
THEN #SS(0); FI;
IF CallGate(InstructionPointer) not within code segment limit
THEN #GP(0); FI;
CS:IP ← CallGate(CS:instruction pointer);
(* Segment descriptor information also loaded *)
Push(oldCS:oldIP); (* Return address to calling procedure *)
ELSE (* CallGateSize = 64)
IF pushing 16 bytes on the stack touches non-canonical addresses
THEN #SS(0); FI;
IF RIP non-canonical
THEN #GP(0); FI;
CS:IP ← CallGate(CS:instruction pointer);
(* Segment descriptor information also loaded *)
Push(oldCS:oldIP); (* Return address to calling procedure *)
FI;
FI;
CS(RPL) ← CPL
IF ShadowStackEnabled(CPL)
TASK-GATE:
IF task gate DPL < CPL or RPL
THEN #GP(task gate selector); FI;
IF task gate not present
THEN #NP(task gate selector); FI;
Read the TSS segment selector in the task-gate descriptor;
IF TSS segment selector local/global bit is set to local
or index not within GDT limits
THEN #GP(TSS selector); FI;
Access TSS descriptor in GDT;
IF descriptor is not a TSS segment
THEN #GP(TSS selector); FI;
IF TSS descriptor specifies that the TSS is busy
THEN #GP(TSS selector); FI;
IF TSS not present
THEN #NP(TSS selector); FI;
SWITCH-TASKS (with nesting) to TSS;
IF EIP not within code segment limit
THEN #GP(0); FI;
END;
TASK-STATE-SEGMENT:
IF TSS DPL < CPL or RPL
or TSS descriptor indicates TSS not available
THEN #GP(TSS selector); FI;
IF TSS is not present
THEN #NP(TSS selector); FI;
SWITCH-TASKS (with nesting) to TSS;
IF EIP not within code segment limit
THEN #GP(0); FI;
END;
Flags Affected
All flags are affected if a task switch occurs; no flags are affected if a task switch does not occur.
If the code segment descriptor pointed to by the selector in the 64-bit gate doesn't have the L-
bit set and the D-bit clear.
If the segment descriptor for a segment selector from the 64-bit call gate does not indicate it
is a code segment.
#SS(0) If pushing the return offset or CS selector onto the stack exceeds the bounds of the stack
segment when no stack switch occurs.
If a memory operand effective address is outside the SS segment limit.
If the stack address is in a non-canonical form.
#SS(selector) If pushing the old values of SS selector, stack pointer, EFLAGS, CS selector, offset, or error
code onto the stack violates the canonical boundary when a stack switch occurs.
#NP(selector) If a code segment or 64-bit call gate is not present.
#TS(selector) If the load of the new RSP exceeds the limit of the TSS.
#UD (64-bit mode only) If a far call is direct to an absolute address in memory.
If the LOCK prefix is used.
#PF(fault-code) If a page fault occurs.
#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the
current privilege level is 3.
Description
Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the
linear address specified with the memory operand. If that cache line contains modified data at any level of the
cache hierarchy, that data is written back to memory. The source operand is a byte memory location.
The availability of CLFLUSH is indicated by the presence of the CPUID feature flag CLFSH
(CPUID.01H:EDX[bit 19]). The aligned cache line size affected is also indicated with the CPUID instruction (bits 8
through 15 of the EBX register when the initial value in the EAX register is 1).
The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It
should be noted that processors are free to speculatively fetch and cache data from system memory regions
assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh
instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative
fetching can occur at any time and is not tied to instruction execution, the CLFLUSH instruction is not ordered with
respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be specula-
tively loaded into a cache line just before, during, or after the execution of a CLFLUSH instruction that references
the cache line).
Executions of the CLFLUSH instruction are ordered with respect to each other and with respect to writes, locked
read-modify-write instructions, and fence instructions.1 They are not ordered with respect to executions of
CLFLUSHOPT and CLWB. Software can use the SFENCE instruction to order an execution of CLFLUSH relative to one
of those operations.
The CLFLUSH instruction can be used at all privilege levels and is subject to all permission checking and faults asso-
ciated with a byte load (and in addition, a CLFLUSH instruction is allowed to flush a linear address in an execute-
only segment). Like a load, the CLFLUSH instruction sets the A bit but not the D bit in the page tables.
In some implementations, the CLFLUSH instruction may always cause transactional abort with Transactional
Synchronization Extensions (TSX). The CLFLUSH instruction is not expected to be commonly used inside typical
transactional regions. However, programmers must not rely on CLFLUSH instruction to force a transactional abort,
since whether they cause transactional abort is implementation dependent.
The CLFLUSH instruction was introduced with the SSE2 extensions; however, because it has its own CPUID feature
flag, it can be implemented in IA-32 processors that do not include the SSE2 extensions. Also, detecting the pres-
ence of the SSE2 extensions with the CPUID instruction does not guarantee that the CLFLUSH instruction is imple-
mented in the processor.
CLFLUSH operation is the same in non-64-bit modes and 64-bit mode.
Operation
Flush_Cache_Line(SRC);
1. Earlier versions of this manual specified that executions of the CLFLUSH instruction were ordered only by the MFENCE instruction.
All processors implementing the CLFLUSH instruction also order it relative to the other operations enumerated above.
Description
Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the
linear address specified with the memory operand. If that cache line contains modified data at any level of the
cache hierarchy, that data is written back to memory. The source operand is a byte memory location.
The availability of CLFLUSHOPT is indicated by the presence of the CPUID feature flag CLFLUSHOPT
(CPUID.(EAX=7,ECX=0):EBX[bit 23]). The aligned cache line size affected is also indicated with the CPUID instruc-
tion (bits 8 through 15 of the EBX register when the initial value in the EAX register is 1).
The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It
should be noted that processors are free to speculatively fetch and cache data from system memory regions
assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh
instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative
fetching can occur at any time and is not tied to instruction execution, the CLFLUSH instruction is not ordered with
respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be specula-
tively loaded into a cache line just before, during, or after the execution of a CLFLUSH instruction that references
the cache line).
Executions of the CLFLUSHOPT instruction are ordered with respect to fence instructions and to locked read-
modify-write instructions; they are also ordered with respect to older writes to the cache line being invalidated.
They are not ordered with respect to other executions of CLFLUSHOPT, to executions of CLFLUSH and CLWB, or to
younger writes to the cache line being invalidated. Software can use the SFENCE instruction to order an execution
of CLFLUSHOPT relative to one of those operations.
The CLFLUSHOPT instruction can be used at all privilege levels and is subject to all permission checking and faults
associated with a byte load (and in addition, a CLFLUSHOPT instruction is allowed to flush a linear address in an
execute-only segment). Like a load, the CLFLUSHOPT instruction sets the A bit but not the D bit in the page tables.
In some implementations, the CLFLUSHOPT instruction may always cause transactional abort with Transactional
Synchronization Extensions (TSX). The CLFLUSHOPT instruction is not expected to be commonly used inside
typical transactional regions. However, programmers must not rely on CLFLUSHOPT instruction to force a transac-
tional abort, since whether they cause transactional abort is implementation dependent.
CLFLUSHOPT operation is the same in non-64-bit modes and 64-bit mode.
Operation
Flush_Cache_Line_Optimized(SRC);
Description
Clear busy flag in supervisor shadow stack token reference by m64. Subsequent to marking the shadow stack as
not busy the SSP is loaded with value 0.
Operation
IF (CR4.CET = 0)
THEN #UD; FI;
IF (IA32_S_CET.SH_STK_EN = 0)
THEN #UD; FI;
IF CPL > 0
THEN GP(0); FI;
Flags Affected
CF is set if an invalid token was detected, else it is cleared. ZF, PF, AF, OF, and SF are cleared.
Description
Writes back to memory the cache line (if modified) that contains the linear address specified with the memory
operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the
cache hierarchy in non-modified state. Retaining the line in the cache hierarchy is a performance optimization
(treated as a hint by hardware) to reduce the possibility of cache miss on a subsequent access. Hardware may
choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line from
the cache hierarchy. The source operand is a byte memory location.
The availability of CLWB instruction is indicated by the presence of the CPUID feature flag CLWB (bit 24 of the EBX
register, see “CPUID — CPU Identification” in this chapter). The aligned cache line size affected is also indicated
with the CPUID instruction (bits 8 through 15 of the EBX register when the initial value in the EAX register is 1).
The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It
should be noted that processors are free to speculatively fetch and cache data from system memory regions that
are assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types).
PREFETCHh instructions can be used to provide the processor with hints for this speculative behavior. Because this
speculative fetching can occur at any time and is not tied to instruction execution, the CLWB instruction is not
ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can
be speculatively loaded into a cache line just before, during, or after the execution of a CLWB instruction that refer-
ences the cache line).
Executions of the CLWB instruction are ordered with respect to fence instructions and to locked read-modify-write
instructions; they are also ordered with respect to older writes to the cache line being written back. They are not
ordered with respect to other executions of CLWB, to executions of CLFLUSH and CLFLUSHOPT, or to younger
writes to the cache line being written back. Software can use the SFENCE instruction to order an execution of CLWB
relative to one of those operations.
For usages that require only writing back modified data from cache lines to memory (do not require the line to be
invalidated), and expect to subsequently access the data, software is recommended to use CLWB (with appropriate
fencing) instead of CLFLUSH or CLFLUSHOPT for improved performance.
The CLWB instruction can be used at all privilege levels and is subject to all permission checking and faults associ-
ated with a byte load. Like a load, the CLWB instruction sets the accessed flag but not the dirty flag in the page
tables.
In some implementations, the CLWB instruction may always cause transactional abort with Transactional Synchro-
nization Extensions (TSX). CLWB instruction is not expected to be commonly used inside typical transactional
regions. However, programmers must not rely on CLWB instruction to force a transactional abort, since whether
they cause transactional abort is implementation dependent.
Operation
Cache_Line_Write_Back(m8);
1. The Mod field of the ModR/M byte cannot have value 11B.
Flags Affected
None.
CMOVcc—Conditional Move
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
0F 47 /r CMOVA r16, r/m16 RM Valid Valid Move if above (CF=0 and ZF=0).
0F 47 /r CMOVA r32, r/m32 RM Valid Valid Move if above (CF=0 and ZF=0).
REX.W + 0F 47 /r CMOVA r64, r/m64 RM Valid N.E. Move if above (CF=0 and ZF=0).
0F 43 /r CMOVAE r16, r/m16 RM Valid Valid Move if above or equal (CF=0).
0F 43 /r CMOVAE r32, r/m32 RM Valid Valid Move if above or equal (CF=0).
REX.W + 0F 43 /r CMOVAE r64, r/m64 RM Valid N.E. Move if above or equal (CF=0).
0F 42 /r CMOVB r16, r/m16 RM Valid Valid Move if below (CF=1).
0F 42 /r CMOVB r32, r/m32 RM Valid Valid Move if below (CF=1).
REX.W + 0F 42 /r CMOVB r64, r/m64 RM Valid N.E. Move if below (CF=1).
0F 46 /r CMOVBE r16, r/m16 RM Valid Valid Move if below or equal (CF=1 or ZF=1).
0F 46 /r CMOVBE r32, r/m32 RM Valid Valid Move if below or equal (CF=1 or ZF=1).
REX.W + 0F 46 /r CMOVBE r64, r/m64 RM Valid N.E. Move if below or equal (CF=1 or ZF=1).
0F 42 /r CMOVC r16, r/m16 RM Valid Valid Move if carry (CF=1).
0F 42 /r CMOVC r32, r/m32 RM Valid Valid Move if carry (CF=1).
REX.W + 0F 42 /r CMOVC r64, r/m64 RM Valid N.E. Move if carry (CF=1).
0F 44 /r CMOVE r16, r/m16 RM Valid Valid Move if equal (ZF=1).
0F 44 /r CMOVE r32, r/m32 RM Valid Valid Move if equal (ZF=1).
REX.W + 0F 44 /r CMOVE r64, r/m64 RM Valid N.E. Move if equal (ZF=1).
0F 4F /r CMOVG r16, r/m16 RM Valid Valid Move if greater (ZF=0 and SF=OF).
0F 4F /r CMOVG r32, r/m32 RM Valid Valid Move if greater (ZF=0 and SF=OF).
REX.W + 0F 4F /r CMOVG r64, r/m64 RM V/N.E. NA Move if greater (ZF=0 and SF=OF).
0F 4D /r CMOVGE r16, r/m16 RM Valid Valid Move if greater or equal (SF=OF).
0F 4D /r CMOVGE r32, r/m32 RM Valid Valid Move if greater or equal (SF=OF).
REX.W + 0F 4D /r CMOVGE r64, r/m64 RM Valid N.E. Move if greater or equal (SF=OF).
0F 4C /r CMOVL r16, r/m16 RM Valid Valid Move if less (SF≠ OF).
0F 4C /r CMOVL r32, r/m32 RM Valid Valid Move if less (SF≠ OF).
REX.W + 0F 4C /r CMOVL r64, r/m64 RM Valid N.E. Move if less (SF≠ OF).
0F 4E /r CMOVLE r16, r/m16 RM Valid Valid Move if less or equal (ZF=1 or SF≠ OF).
0F 4E /r CMOVLE r32, r/m32 RM Valid Valid Move if less or equal (ZF=1 or SF≠ OF).
REX.W + 0F 4E /r CMOVLE r64, r/m64 RM Valid N.E. Move if less or equal (ZF=1 or SF≠ OF).
0F 46 /r CMOVNA r16, r/m16 RM Valid Valid Move if not above (CF=1 or ZF=1).
0F 46 /r CMOVNA r32, r/m32 RM Valid Valid Move if not above (CF=1 or ZF=1).
REX.W + 0F 46 /r CMOVNA r64, r/m64 RM Valid N.E. Move if not above (CF=1 or ZF=1).
0F 42 /r CMOVNAE r16, r/m16 RM Valid Valid Move if not above or equal (CF=1).
0F 42 /r CMOVNAE r32, r/m32 RM Valid Valid Move if not above or equal (CF=1).
REX.W + 0F 42 /r CMOVNAE r64, r/m64 RM Valid N.E. Move if not above or equal (CF=1).
0F 43 /r CMOVNB r16, r/m16 RM Valid Valid Move if not below (CF=0).
0F 43 /r CMOVNB r32, r/m32 RM Valid Valid Move if not below (CF=0).
REX.W + 0F 43 /r CMOVNB r64, r/m64 RM Valid N.E. Move if not below (CF=0).
0F 47 /r CMOVNBE r16, r/m16 RM Valid Valid Move if not below or equal (CF=0 and ZF=0).
Description
Each of the CMOVcc instructions performs a move operation if the status flags in the EFLAGS register (CF, OF, PF,
SF, and ZF) are in a specified state (or condition). A condition code (cc) is associated with each instruction to indi-
cate the condition being tested for. If the condition is not satisfied, a move is not performed and execution
continues with the instruction following the CMOVcc instruction.
Specifically, CMOVcc loads data from its source operand into a temporary register unconditionally (regardless of
the condition code and the status flags in the EFLAGS register). If the condition code associated with the instruction
(cc) is satisfied, the data in the temporary register is then copied into the instruction's destination operand.
These instructions can move 16-bit, 32-bit or 64-bit values from memory to a general-purpose register or from one
general-purpose register to another. Conditional moves of 8-bit register operands are not supported.
The condition for each CMOVcc mnemonic is given in the description column of the above table. The terms “less”
and “greater” are used for comparisons of signed integers and the terms “above” and “below” are used for
unsigned integers.
Because a particular state of the status flags can sometimes be interpreted in two ways, two mnemonics are
defined for some opcodes. For example, the CMOVA (conditional move if above) instruction and the CMOVNBE
(conditional move if not below or equal) instruction are alternate mnemonics for the opcode 0F 47H.
The CMOVcc instructions were introduced in P6 family processors; however, these instructions may not be
supported by all IA-32 processors. Software can determine if the CMOVcc instructions are supported by checking
the processor’s feature information with the CPUID instruction (see “CPUID—CPU Identification” in this chapter).
In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi-
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the
beginning of this section for encoding data and limits.
Operation
temp ← SRC
IF condition TRUE
THEN DEST ← temp;
ELSE IF (OperandSize = 32 and IA-32e mode active)
THEN DEST[63:32] ← 0;
FI;
Flags Affected
None.
Description
Performs a SIMD compare of the packed single-precision floating-point values in the second source operand and
the first source operand and returns the results of the comparison to the destination operand. The comparison
predicate operand (immediate byte) specifies the type of comparison performed on each of the pairs of packed
values.
EVEX encoded versions: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand (first operand) is an opmask register.
Comparison results are written to the destination operand under the writemask k2. Each comparison result is a
single mask bit of 1 (comparison true) or 0 (comparison false).
VEX.256 encoded version: The first source operand (second operand) is a YMM register. The second source operand
(third operand) can be a YMM register or a 256-bit memory location. The destination operand (first operand) is a
YMM register. Eight comparisons are performed with results written to the destination operand. The result of each
comparison is a doubleword mask of all 1s (comparison true) or all 0s (comparison false).
128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The
second source operand (second operand) can be an XMM register or 128-bit memory location. Bits (MAXVL-1:128)
of the corresponding ZMM destination register remain unchanged. Four comparisons are performed with results
written to bits 127:0 of the destination operand. The result of each comparison is a doubleword mask of all 1s
(comparison true) or all 0s (comparison false).
VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source
operand (third operand) can be an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destina-
tion ZMM register are zeroed. Four comparisons are performed with results written to bits 127:0 of the destination
operand.
The comparison predicate operand is an 8-bit immediate:
• For instructions encoded using the VEX prefix and EVEX prefix, bits 4:0 define the type of comparison to be
performed (see Table 3-1). Bits 5 through 7 of the immediate are reserved.
• For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see
the first 8 rows of Table 3-1). Bits 3 through 7 of the immediate are reserved.
The unordered relationship is true when at least one of the two source operands being compared is a NaN; the
ordered relationship is true when neither source operand is a NaN.
A subsequent computational instruction that uses the mask result in the destination operand as an input operand
will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask
of all 1s corresponds to a QNaN.
Note that processors with “CPUID.1H:ECX.AVX =0” do not implement the “greater-than”, “greater-than-or-equal”,
“not-greater than”, and “not-greater-than-or-equal relations” predicates. These comparisons can be made either
by using the inverse relationship (that is, use the “not-less-than-or-equal” to make a “greater-than” comparison)
or by using software emulation. When using software emulation, the program must swap the operands (copying
registers when necessary to protect the data that will now be in the destination), and then perform the compare
using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7
(Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A) under the heading Emulation.
Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand
CMPPS instruction, for processors with “CPUID.1H:ECX.AVX =0”. See Table 3-4. Compiler should treat reserved
Imm8 values as illegal syntax.
Table 3-4. Pseudo-Op and CMPPS Implementation
:
The greater-than relations that the processor does not implement require more than one instruction to emulate in
software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the
operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to
the correct destination register and that the source operand is left intact.)
Processors with “CPUID.1H:ECX.AVX =1” implement the full complement of 32 predicates shown in Table 3-5, soft-
ware emulation is no longer needed. Compilers and assemblers may implement the following three-operand
pseudo-ops in addition to the four-operand VCMPPS instruction. See Table 3-5, where the notation of reg1 and
reg2 represent either XMM registers or YMM registers. Compiler should treat reserved Imm8 values as illegal
syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic inter-
face. Compilers and assemblers may implement three-operand pseudo-ops for EVEX encoded VCMPPS instructions
in a similar fashion by extending the syntax listed in Table 3-5.
:
Operation
CASE (COMPARISON PREDICATE) OF
0: OP3 EQ_OQ; OP5 EQ_OQ;
1: OP3 LT_OS; OP5 LT_OS;
2: OP3 LE_OS; OP5 LE_OS;
3: OP3 UNORD_Q; OP5 UNORD_Q;
4: OP3 NEQ_UQ; OP5 NEQ_UQ;
5: OP3 NLT_US; OP5 NLT_US;
6: OP3 NLE_US; OP5 NLE_US;
7: OP3 ORD_Q; OP5 ORD_Q;
8: OP5 EQ_UQ;
9: OP5 NGE_US;
10: OP5 NGT_US;
11: OP5 FALSE_OQ;
12: OP5 NEQ_OQ;
13: OP5 GE_OS;
14: OP5 GT_OS;
15: OP5 TRUE_UQ;
16: OP5 EQ_OS;
17: OP5 LT_OQ;
18: OP5 LE_OQ;
19: OP5 UNORD_S;
20: OP5 NEQ_US;
21: OP5 NLT_UQ;
22: OP5 NLE_UQ;
23: OP5 ORD_S;
24: OP5 EQ_US;
25: OP5 NGE_UQ;
26: OP5 NGT_UQ;
27: OP5 FALSE_OS;
28: OP5 NEQ_OS;
29: OP5 GE_OQ;
30: OP5 GT_OQ;
31: OP5 TRUE_US;
DEFAULT: Reserved
ESAC;
Other Exceptions
VEX-encoded instructions, see Exceptions Type 2.
EVEX-encoded instructions, see Exceptions Type E2.
CPUID—CPU Identification
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
0F A2 CPUID ZO Valid Valid Returns processor identification and feature
information to the EAX, EBX, ECX, and EDX
registers, as determined by input entered in
EAX (in some cases, ECX as well).
Description
The ID flag (bit 21) in the EFLAGS register indicates support for the CPUID instruction. If a software procedure can
set and clear this flag, the processor executing the procedure supports the CPUID instruction. This instruction oper-
ates the same in non-64-bit modes and 64-bit mode.
CPUID returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers.1 The
instruction’s output is dependent on the contents of the EAX register upon execution (in some cases, ECX as well).
For example, the following pseudocode loads EAX with 00H and causes CPUID to return a Maximum Return Value
and the Vendor Identification String in the appropriate registers:
1. On Intel 64 processors, CPUID clears the high 32 bits of the RAX/RBX/RCX/RDX registers in all modes.
2. CPUID leaf 1FH is a preferred superset to leaf 0BH. Intel recommends first checking for the existence of CPUID leaf 1FH before using
leaf 0BH.
EAX Bits 31 - 00: Reports the maximum input value for supported leaf 7 sub-leaves.
EBX Bit 00: FSGSBASE. Supports RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE if 1.
Bit 01: IA32_TSC_ADJUST MSR is supported if 1.
Bit 02: SGX. Supports Intel® Software Guard Extensions (Intel® SGX Extensions) if 1.
Bit 03: BMI1.
Bit 04: HLE.
Bit 05: AVX2.
Bit 06: FDP_EXCPTN_ONLY. x87 FPU Data Pointer updated only on x87 exceptions if 1.
Bit 07: SMEP. Supports Supervisor-Mode Execution Prevention if 1.
Bit 08: BMI2.
Bit 09: Supports Enhanced REP MOVSB/STOSB if 1.
Bit 10: INVPCID. If 1, supports INVPCID instruction for system software that manages process-context
identifiers.
Bit 11: RTM.
Bit 12: RDT-M. Supports Intel® Resource Director Technology (Intel® RDT) Monitoring capability if 1.
Bit 13: Deprecates FPU CS and FPU DS values if 1.
Bit 14: MPX. Supports Intel® Memory Protection Extensions if 1.
Bit 15: RDT-A. Supports Intel® Resource Director Technology (Intel® RDT) Allocation capability if 1.
Bit 16: AVX512F.
Bit 17: AVX512DQ.
Bit 18: RDSEED.
Bit 19: ADX.
Bit 20: SMAP. Supports Supervisor-Mode Access Prevention (and the CLAC/STAC instructions) if 1.
Bit 21: AVX512_IFMA.
Bit 22: Reserved.
Bit 23: CLFLUSHOPT.
Bit 24: CLWB.
Bit 25: Intel Processor Trace.
Bit 26: AVX512PF. (Intel® Xeon Phi™ only.)
Bit 27: AVX512ER. (Intel® Xeon Phi™ only.)
Bit 28: AVX512CD.
Bit 29: SHA. supports Intel® Secure Hash Algorithm Extensions (Intel® SHA Extensions) if 1.
Bit 30: AVX512BW.
Bit 31: AVX512VL.
EAX Bits 04 - 00: Number of bits to shift right on x2APIC ID to get a unique topology ID of the next level type*.
All logical processors with the same next level ID share current level.
Bits 31 - 05: Reserved.
EBX Bits 15 - 00: Number of logical processors at this level type. The number reflects configuration as shipped
by Intel**.
Bits 31- 16: Reserved.
*** The value of the “level type” field is not related to level numbers in any way, higher “level type” val-
ues do not mean higher levels. Level type field has the following encoding:
0: Invalid.
1: SMT.
2: Core.
3-255: Reserved.
Processor Extended State Enumeration Main Leaf (EAX = 0DH, ECX = 0)
0DH NOTES:
Leaf 0DH main leaf (ECX = 0).
EAX Bits 31 - 00: Reports the supported bits of the lower 32 bits of XCR0. XCR0[n] can be set to 1 only if
EAX[n] is 1.
Bit 00: x87 state.
Bit 01: SSE state.
Bit 02: AVX state.
Bits 04 - 03: MPX state.
Bits 07 - 05: AVX-512 state.
Bit 08: Used for IA32_XSS.
Bit 09: PKRU state.
Bits 12 - 10: Reserved.
Bit 13: Used for IA32_XSS.
Bits 31 - 14: Reserved.
EBX Bits 31 - 00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) required by
enabled features in XCR0. May be different than ECX if some features at the end of the XSAVE save area
are not enabled.
ECX Bit 31 - 00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) of the
XSAVE/XRSTOR save area required by all supported features in the processor, i.e., all the valid bit fields in
XCR0.
EDX Bit 31 - 00: Reports the supported bits of the upper 32 bits of XCR0. XCR0[n+32] can be set to 1 only if
EDX[n] is 1.
Bits 31 - 00: Reserved.
Processor Extended State Enumeration Sub-leaf (EAX = 0DH, ECX = 1)
0DH EAX Bit 00: XSAVEOPT is available.
Bit 01: Supports XSAVEC and the compacted form of XRSTOR if set.
Bit 02: Supports XGETBV with ECX = 1 if set.
Bit 03: Supports XSAVES/XRSTORS and IA32_XSS if set.
Bits 31 - 04: Reserved.
EBX Bits 31 - 00: The size in bytes of the XSAVE area containing all states enabled by XCRO | IA32_XSS.
EBX[19:00]: Bits 51:32 of the physical address of the base of the EPC section.
EBX[31:20]: Reserved.
EDX[19:00]: Bits 51:32 of the size of the corresponding EPC section within the Processor Reserved
Memory.
EDX[31:20]: Reserved.
Intel Processor Trace Enumeration Main Leaf (EAX = 14H, ECX = 0)
14H NOTES:
Leaf 14H main leaf (ECX = 0).
EAX Bits 31 - 00: Reports the maximum sub-leaf supported in leaf 14H.
EBX Bit 00: If 1, indicates that IA32_RTIT_CTL.CR3Filter can be set to 1, and that IA32_RTIT_CR3_MATCH
MSR can be accessed.
Bit 01: If 1, indicates support of Configurable PSB and Cycle-Accurate Mode.
Bit 02: If 1, indicates support of IP Filtering, TraceStop filtering, and preservation of Intel PT MSRs across
warm reset.
Bit 03: If 1, indicates support of MTC timing packet and suppression of COFI-based packets.
Bit 04: If 1, indicates support of PTWRITE. Writes can set IA32_RTIT_CTL[12] (PTWEn) and
IA32_RTIT_CTL[5] (FUPonPTW), and PTWRITE can generate packets.
Bit 05: If 1, indicates support of Power Event Trace. Writes can set IA32_RTIT_CTL[4] (PwrEvtEn),
enabling Power Event Trace packet generation.
Bit 31 - 06: Reserved.
ECX Bit 00: If 1, Tracing can be enabled with IA32_RTIT_CTL.ToPA = 1, hence utilizing the ToPA output
scheme; IA32_RTIT_OUTPUT_BASE and IA32_RTIT_OUTPUT_MASK_PTRS MSRs can be accessed.
Bit 01: If 1, ToPA tables can hold any number of output entries, up to the maximum allowed by the Mas-
kOrTableOffset field of IA32_RTIT_OUTPUT_MASK_PTRS.
Bit 02: If 1, indicates support of Single-Range Output scheme.
Bit 03: If 1, indicates support of output to Trace Transport subsystem.
Bit 30 - 04: Reserved.
Bit 31: If 1, generated packets which contain IP payloads have LIP values, which include the CS base com-
ponent.
EDX Bits 31 - 00: Reserved.
While a processor may support the Processor Frequency Information leaf, fields that return a value of
zero are not supported.
System-On-Chip Vendor Attribute Enumeration Main Leaf (EAX = 17H, ECX = 0)
17H NOTES:
Leaf 17H main leaf (ECX = 0).
Leaf 17H output depends on the initial value in ECX.
Leaf 17H sub-leaves 1 through 3 reports SOC Vendor Brand String.
Leaf 17H is valid if MaxSOCID_Index >= 3.
Leaf 17H sub-leaves 4 and above are reserved.
System-On-Chip Vendor Attribute Enumeration Sub-leaves (EAX = 17H, ECX > MaxSOCID_Index)
17H NOTES:
Leaf 17H output depends on the initial value in ECX.
EAX Bits 31 - 00: Reports the maximum input value of supported sub-leaf in leaf 18H.
EAX Bits 04 - 00: Number of bits to shift right on x2APIC ID to get a unique topology ID of the next level type*.
All logical processors with the same next level ID share current level.
Bits 31 - 05: Reserved.
EBX Bits 15 - 00: Number of logical processors at this level type. The number reflects configuration as shipped
by Intel**.
Bits 31- 16: Reserved.
ECX Bits 07 - 00: Level number. Same value in ECX input.
Bits 15 - 08: Level type***.
Bits 31 - 16: Reserved.
EDX Bits 31- 00: x2APIC ID the current logical processor.
*** The value of the “level type” field is not related to level numbers in any way, higher “level type” val-
ues do not mean higher levels. Level type field has the following encoding:
0: Invalid.
1: SMT.
2: Core.
3: Module.
4: Tile.
5: Die.
6-255: Reserved.
Unimplemented CPUID Leaf Functions
40000000H Invalid. No existing or future CPU will return processor identification or feature information if the initial
- EAX value is in the range 40000000H to 4FFFFFFFH.
4FFFFFFFH
Extended Function CPUID Information
80000000H EAX Maximum Input Value for Extended Function CPUID Information.
EBX Reserved.
ECX Reserved.
EDX Reserved.
80000001H EAX Extended Processor Signature and Feature Bits.
EBX Reserved.
ECX Bit 00: LAHF/SAHF available in 64-bit mode.*
Bits 04 - 01: Reserved.
Bit 05: LZCNT.
Bits 07 - 06: Reserved.
Bit 08: PREFETCHW.
Bits 31 - 09: Reserved.
EDX Bits 10 - 00: Reserved.
Bit 11: SYSCALL/SYSRET.**
Bits 19 - 12: Reserved = 0.
Bit 20: Execute Disable Bit available.
Bits 25 - 21: Reserved = 0.
Bit 26: 1-GByte pages are available if 1.
Bit 27: RDTSCP and IA32_TSC_AUX are available if 1.
Bit 28: Reserved = 0.
Bit 29: Intel® 64 Architecture available if 1.
Bits 31 - 30: Reserved = 0.
NOTES:
* LAHF and SAHF are always available in other modes, regardless of the enumeration of this feature flag.
** Intel processors support SYSCALL and SYSRET only in 64-bit mode. This feature flag is always enumer-
ated as 0 outside 64-bit mode.
** CPUID leaf 04H provides details of deterministic cache parameters, including the L2 cache in sub-leaf 2
80000007H EAX Reserved = 0.
EBX Reserved = 0.
ECX Reserved = 0.
EDX Bits 07 - 00: Reserved = 0.
Bit 08: Invariant TSC available if 1.
Bits 31 - 09: Reserved = 0.
80000008H EAX Linear/Physical Address size.
Bits 07 - 00: #Physical Address Bits*.
Bits 15 - 08: #Linear Address Bits.
Bits 31 - 16: Reserved = 0.
NOTES:
* If CPUID.80000008H:EAX[7:0] is supported, the maximum physical address number supported should
come from this field.
INPUT EAX = 0: Returns CPUID’s Highest Value for Basic Processor Information and the Vendor Identification String
When CPUID executes with EAX set to 0, the processor returns the highest value the CPUID recognizes for
returning basic processor information. The value is returned in the EAX register and is processor specific.
A vendor identification string is also returned in EBX, EDX, and ECX. For Intel processors, the string is “Genuin-
eIntel” and is expressed:
EBX ← 756e6547h (* “Genu”, with G in the low eight bits of BL *)
EDX ← 49656e69h (* “ineI”, with i in the low eight bits of DL *)
ECX ← 6c65746eh (* “ntel”, with n in the low eight bits of CL *)
INPUT EAX = 80000000H: Returns CPUID’s Highest Value for Extended Processor Information
When CPUID executes with EAX set to 80000000H, the processor returns the highest value the processor recog-
nizes for returning extended processor information. The value is returned in the EAX register and is processor
specific.
31 28 27 20 19 16 15 14 13 12 11 8 7 4 3 0
Reserved
OM16525
NOTE
See Chapter 19 in the Intel®
64 and IA-32 Architectures Software Developer’s Manual, Volume 1,
for information on identifying earlier IA-32 processors.
The Extended Family ID needs to be examined only when the Family ID is 0FH. Integrate the fields into a display
using the following rule:
IF Family_ID ≠ 0FH
THEN DisplayFamily = Family_ID;
ELSE DisplayFamily = Extended_Family_ID + Family_ID;
(* Right justify and zero-extend 4-bit field. *)
FI;
(* Show DisplayFamily as HEX field. *)
The Extended Model ID needs to be examined only when the Family ID is 06H or 0FH. Integrate the field into a
display using the following rule:
NOTE
Software must confirm that a processor feature is present using feature flags returned by CPUID
prior to using the feature. Software should not depend on future offerings retaining all features.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
ECX
0
RDRAND
F16C
AVX
OSXSAVE
XSAVE
AES
TSC-Deadline
POPCNT
MOVBE
x2APIC
SSE4_2 — SSE4.2
SSE4_1 — SSE4.1
DCA — Direct Cache Access
PCID — Process-context Identifiers
PDCM — Perf/Debug Capability MSR
xTPR Update Control
CMPXCHG16B
FMA — Fused Multiply Add
SDBG
CNXT-ID — L1 Context ID
SSSE3 — SSSE3 Extensions
TM2 — Thermal Monitor 2
EIST — Enhanced Intel SpeedStep® Technology
SMX — Safer Mode Extensions
VMX — Virtual Machine Extensions
DS-CPL — CPL Qualified Debug Store
MONITOR — MONITOR/MWAIT
DTES64 — 64-bit DS Area
PCLMULQDQ — Carryless Multiplication
SSE3 — SSE3 Extensions
OM16524b
Reserved
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
EDX
Reserved
OM16523
Table 3-11. More on Feature Information Returned in the EDX Register (Contd.)
Bit # Mnemonic Description
21 DS Debug Store. The processor supports the ability to write debug information into a memory resident buffer.
This feature is used by the branch trace store (BTS) and processor event-based sampling (PEBS) facilities (see
Chapter 23, “Introduction to Virtual-Machine Extensions,” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3C).
22 ACPI Thermal Monitor and Software Controlled Clock Facilities. The processor implements internal MSRs that
allow processor temperature to be monitored and processor performance to be modulated in predefined duty
cycles under software control.
23 MMX Intel MMX Technology. The processor supports the Intel MMX technology.
24 FXSR FXSAVE and FXRSTOR Instructions. The FXSAVE and FXRSTOR instructions are supported for fast save and
restore of the floating point context. Presence of this bit also indicates that CR4.OSFXSR is available for an
operating system to indicate that it supports the FXSAVE and FXRSTOR instructions.
25 SSE SSE. The processor supports the SSE extensions.
26 SSE2 SSE2. The processor supports the SSE2 extensions.
27 SS Self Snoop. The processor supports the management of conflicting memory types by performing a snoop of its
own cache structure for transactions issued to the bus.
28 HTT Max APIC IDs reserved field is Valid. A value of 0 for HTT indicates there is only a single logical processor in
the package and software should assume only a single APIC ID is reserved. A value of 1 for HTT indicates the
value in CPUID.1.EBX[23:16] (the Maximum number of addressable IDs for logical processors in this package) is
valid for the package.
29 TM Thermal Monitor. The processor implements the thermal monitor automatic thermal control circuitry (TCC).
30 Reserved Reserved
31 PBE Pending Break Enable. The processor supports the use of the FERR#/PBE# pin when the processor is in the
stop-clock state (STPCLK# is asserted) to signal the processor that an interrupt is pending and that the
processor should return to normal operation to handle the interrupt. Bit 10 (PBE enable) in the
IA32_MISC_ENABLE MSR enables this capability.
INPUT EAX = 02H: TLB/Cache/Prefetch Information Returned in EAX, EBX, ECX, EDX
When CPUID executes with EAX set to 02H, the processor returns information about the processor’s internal TLBs,
cache and prefetch hardware in the EAX, EBX, ECX, and EDX registers. The information is reported in encoded form
and fall into the following categories:
• The least-significant byte in register EAX (register AL) will always return 01H. Software should ignore this value
and not interpret it as an informational descriptor.
• The most significant bit (bit 31) of each register indicates whether the register contains valid information (set
to 0) or is reserved (set to 1).
• If a register contains valid information, the information is contained in 1 byte descriptors. There are four types
of encoding values for the byte descriptor, the encoding type is noted in the second column of Table 3-12. Table
3-12 lists the encoding of these descriptors. Note that the order of descriptors in the EAX, EBX, ECX, and EDX
registers is not defined; that is, specific bytes are not designated to contain descriptors for specific cache,
prefetch, or TLB types. The descriptors may appear in any order. Note also a processor may report a general
descriptor type (FFH) and not report any byte descriptor of “cache type” via CPUID leaf 2.
EAX 66 5B 50 01H
EBX 0H
ECX 0H
EDX 00 7A 70 00H
Which means:
• The least-significant byte (byte 0) of register EAX is set to 01H. This value should be ignored.
• The most-significant bit of all four registers (EAX, EBX, ECX, and EDX) is set to 0, indicating that each register
contains valid 1-byte descriptors.
• Bytes 1, 2, and 3 of register EAX indicate that the processor has:
— 50H - a 64-entry instruction TLB, for mapping 4-KByte and 2-MByte or 4-MByte pages.
— 5BH - a 64-entry data TLB, for mapping 4-KByte and 4-MByte pages.
— 66H - an 8-KByte 1st level data cache, 4-way set associative, with a 64-Byte cache line size.
• The descriptors in registers EBX and ECX are valid, but contain NULL descriptors.
• Bytes 0, 1, 2, and 3 of register EDX indicate that the processor has:
— 00H - NULL descriptor.
— 70H - Trace cache: 12 K-μop, 8-way set associative.
— 7AH - a 256-KByte 2nd level cache, 8-way set associative, with a sectored, 64-byte cache line size.
— 00H - NULL descriptor.
INPUT EAX = 04H: Returns Deterministic Cache Parameters for Each Level
When CPUID executes with EAX set to 04H and ECX contains an index value, the processor returns encoded data
that describe a set of deterministic cache parameters (for the cache level associated with the input in ECX). Valid
index values start from 0.
Software can enumerate the deterministic cache parameters for each level of the cache hierarchy starting with an
index value of 0, until the parameters report the value associated with the cache type field is 0. The architecturally
defined fields reported by deterministic cache parameters are documented in Table 3-8.
This Cache Size in Bytes
= (Ways + 1) * (Partitions + 1) * (Line_Size + 1) * (Sets + 1)
= (EBX[31:22] + 1) * (EBX[21:12] + 1) * (EBX[11:0] + 1) * (ECX + 1)
The CPUID leaf 04H also reports data that can be used to derive the topology of processor cores in a physical
package. This information is constant for all valid index values. Software can query the raw data reported by
executing CPUID with EAX=04H and ECX=0 and use it as part of the topology enumeration algorithm described in
Chapter 8, “Multiple-Processor Management,” in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A.
INPUT EAX = 0FH: Returns Intel Resource Director Technology (Intel RDT) Monitoring Enumeration Information
When CPUID executes with EAX set to 0FH and ECX = 0, the processor returns information about the bit-vector
representation of QoS monitoring resource types that are supported in the processor and maximum range of RMID
values the processor can use to monitor of any supported resource types. Each bit, starting from bit 1, corresponds
to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or ResID) that soft-
ware must use to query QoS monitoring capability available for that type. See Table 3-8.
When CPUID executes with EAX set to 0FH and ECX = n (n >= 1, and is a valid ResID), the processor returns infor-
mation software can use to program IA32_PQR_ASSOC, IA32_QM_EVTSEL MSRs before reading QoS data from the
IA32_QM_CTR MSR.
INPUT EAX = 10H: Returns Intel Resource Director Technology (Intel RDT) Allocation Enumeration Information
When CPUID executes with EAX set to 10H and ECX = 0, the processor returns information about the bit-vector
representation of QoS Enforcement resource types that are supported in the processor. Each bit, starting from bit
1, corresponds to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or
ResID) that software must use to query QoS enforcement capability available for that type. See Table 3-8.
When CPUID executes with EAX set to 10H and ECX = n (n >= 1, and is a valid ResID), the processor returns infor-
mation about available classes of service and range of QoS mask MSRs that software can use to configure each
class of services using capability bit masks in the QoS Mask registers, IA32_resourceType_Mask_n.
INPUT EAX = 15H: Returns Time Stamp Counter and Nominal Core Crystal Clock Information
When CPUID executes with EAX set to 15H and ECX = 0H, the processor returns information about Time Stamp
Counter and Core Crystal Clock. See Table 3-8.
Input: EAX=
0x80000000
CPUID
CPUID
True ≥
Function
Extended
Supported
OM15194
Table 3-13 shows the brand string that is returned by the first processor in the Pentium 4 processor family.
"zHM", or
Match
"zHG", or
Substring
"zHT"
False
IF Substring Matched Report Error
If "zHG"
Multiplier = 1 x 109
Determine "Multiplier" If "zHT"
Multiplier = 1 x 1012
Scan Digits
Until Blank Reverse Digits
Determine "Freq"
In Reverse Order To Decimal Value
Processor Base
Frequency =
"Freq" = X.YZ if
"Freq" x "Multiplier"
Digits = "ZY.X"
OM15195
Operation
IA32_BIOS_SIGN_ID MSR ← Update with installed microcode revision number;
CASE (EAX) OF
EAX = 0:
EAX ← Highest basic function input value understood by CPUID;
EBX ← Vendor identification string;
EDX ← Vendor identification string;
ECX ← Vendor identification string;
BREAK;
EAX = 1H:
EAX[3:0] ← Stepping ID;
EAX[7:4] ← Model;
EAX[11:8] ← Family;
EAX[13:12] ← Processor type;
EAX[15:14] ← Reserved;
EAX[19:16] ← Extended Model;
EAX[27:20] ← Extended Family;
EAX[31:28] ← Reserved;
EBX[7:0] ← Brand Index; (* Reserved if the value is zero. *)
EBX[15:8] ← CLFLUSH Line Size;
EBX[16:23] ← Reserved; (* Number of threads enabled = 2 if MT enable fuse set. *)
EBX[24:31] ← Initial APIC ID;
ECX ← Feature flags; (* See Figure 3-7. *)
EDX ← Feature flags; (* See Figure 3-8. *)
BREAK;
EAX = 2H:
EAX ← Cache and TLB information;
EBX ← Cache and TLB information;
ECX ← Cache and TLB information;
EDX ← Cache and TLB information;
BREAK;
EAX = 3H:
EAX ← Reserved;
EBX ← Reserved;
ECX ← ProcessorSerialNumber[31:0];
(* Pentium III processors only, otherwise reserved. *)
EDX ← ProcessorSerialNumber[63:32];
(* Pentium III processors only, otherwise reserved. *
BREAK
EAX = 4H:
EAX ← Deterministic Cache Parameters Leaf; (* See Table 3-8. *)
EBX ← Deterministic Cache Parameters Leaf;
ECX ← Deterministic Cache Parameters Leaf;
EDX ← Deterministic Cache Parameters Leaf;
BREAK;
EAX = 5H:
EAX ← MONITOR/MWAIT Leaf; (* See Table 3-8. *)
EBX ← MONITOR/MWAIT Leaf;
ECX ← MONITOR/MWAIT Leaf;
EDX ← MONITOR/MWAIT Leaf;
BREAK;
EAX = 6H:
EAX ← Thermal and Power Management Leaf; (* See Table 3-8. *)
EBX ← Thermal and Power Management Leaf;
ECX ← Thermal and Power Management Leaf;
EDX ← Thermal and Power Management Leaf;
BREAK;
EAX = 7H:
EAX ← Structured Extended Feature Flags Enumeration Leaf; (* See Table 3-8. *)
EBX ← Structured Extended Feature Flags Enumeration Leaf;
ECX ← Structured Extended Feature Flags Enumeration Leaf;
EDX ← Structured Extended Feature Flags Enumeration Leaf;
BREAK;
EAX = 8H:
EAX ← Reserved = 0;
EBX ← Reserved = 0;
ECX ← Reserved = 0;
EDX ← Reserved = 0;
BREAK;
EAX = 9H:
EAX ← Direct Cache Access Information Leaf; (* See Table 3-8. *)
EBX ← Direct Cache Access Information Leaf;
ECX ← Direct Cache Access Information Leaf;
EDX ← Direct Cache Access Information Leaf;
BREAK;
EAX = AH:
EAX ← Architectural Performance Monitoring Leaf; (* See Table 3-8. *)
EBX ← Architectural Performance Monitoring Leaf;
ECX ← Architectural Performance Monitoring Leaf;
EDX ← Architectural Performance Monitoring Leaf;
BREAK
EAX = BH:
EAX ← Extended Topology Enumeration Leaf; (* See Table 3-8. *)
EBX ← Extended Topology Enumeration Leaf;
ECX ← Extended Topology Enumeration Leaf;
EDX ← Extended Topology Enumeration Leaf;
BREAK;
EAX = CH:
EAX ← Reserved = 0;
EBX ← Reserved = 0;
ECX ← Reserved = 0;
EDX ← Reserved = 0;
BREAK;
EAX = DH:
EAX ← Processor Extended State Enumeration Leaf; (* See Table 3-8. *)
EBX ← Processor Extended State Enumeration Leaf;
ECX ← Processor Extended State Enumeration Leaf;
EDX ← Processor Extended State Enumeration Leaf;
BREAK;
EAX = EH:
EAX ← Reserved = 0;
EBX ← Reserved = 0;
ECX ← Reserved = 0;
EDX ← Reserved = 0;
BREAK;
EAX = FH:
EAX ← Intel Resource Director Technology Monitoring Enumeration Leaf; (* See Table 3-8. *)
EBX ← Intel Resource Director Technology Monitoring Enumeration Leaf;
ECX ← Intel Resource Director Technology Monitoring Enumeration Leaf;
EDX ← Intel Resource Director Technology Monitoring Enumeration Leaf;
BREAK;
EAX = 10H:
EAX ← Intel Resource Director Technology Allocation Enumeration Leaf; (* See Table 3-8. *)
EBX ← Intel Resource Director Technology Allocation Enumeration Leaf;
ECX ← Intel Resource Director Technology Allocation Enumeration Leaf;
EDX ← Intel Resource Director Technology Allocation Enumeration Leaf;
BREAK;
EAX = 12H:
EAX ← Intel SGX Enumeration Leaf; (* See Table 3-8. *)
EBX ← Intel SGX Enumeration Leaf;
ECX ← Intel SGX Enumeration Leaf;
EDX ← Intel SGX Enumeration Leaf;
BREAK;
EAX = 14H:
EAX ← Intel Processor Trace Enumeration Leaf; (* See Table 3-8. *)
EBX ← Intel Processor Trace Enumeration Leaf;
ECX ← Intel Processor Trace Enumeration Leaf;
EDX ← Intel Processor Trace Enumeration Leaf;
BREAK;
EAX = 15H:
EAX ← Time Stamp Counter and Nominal Core Crystal Clock Information Leaf; (* See Table 3-8. *)
EBX ← Time Stamp Counter and Nominal Core Crystal Clock Information Leaf;
ECX ← Time Stamp Counter and Nominal Core Crystal Clock Information Leaf;
EDX ← Time Stamp Counter and Nominal Core Crystal Clock Information Leaf;
BREAK;
EAX = 16H:
EAX ← Processor Frequency Information Enumeration Leaf; (* See Table 3-8. *)
EBX ← Processor Frequency Information Enumeration Leaf;
ECX ← Processor Frequency Information Enumeration Leaf;
EDX ← Processor Frequency Information Enumeration Leaf;
BREAK;
EAX = 17H:
EAX ← System-On-Chip Vendor Attribute Enumeration Leaf; (* See Table 3-8. *)
EBX ← System-On-Chip Vendor Attribute Enumeration Leaf;
ECX ← System-On-Chip Vendor Attribute Enumeration Leaf;
EDX ← System-On-Chip Vendor Attribute Enumeration Leaf;
BREAK;
EAX = 18H:
EAX ← Deterministic Address Translation Parameters Enumeration Leaf; (* See Table 3-8. *)
EBX ← Deterministic Address Translation Parameters Enumeration Leaf;
ECX ←Deterministic Address Translation Parameters Enumeration Leaf;
EDX ← Deterministic Address Translation Parameters Enumeration Leaf;
BREAK;
EAX = 1AH:
EAX ← Hybrid Information Enumeration Leaf; (* See Table 3-8. *)
EBX ← Hybrid Information Enumeration Leaf;
ECX ← Hybrid Information Enumeration Leaf;
EDX ← Hybrid Information Enumeration Leaf;
BREAK;
EAX = 1FH:
EAX ← V2 Extended Topology Enumeration Leaf; (* See Table 3-8. *)
EBX ← V2 Extended Topology Enumeration Leaf;
ECX ← V2 Extended Topology Enumeration Leaf;
EDX ← V2 Extended Topology Enumeration Leaf;
BREAK;
EAX = 80000000H:
EAX ← Highest extended function input value understood by CPUID;
EBX ← Reserved;
ECX ← Reserved;
EDX ← Reserved;
BREAK;
EAX = 80000001H:
EAX ← Reserved;
EBX ← Reserved;
ECX ← Extended Feature Bits (* See Table 3-8.*);
EDX ← Extended Feature Bits (* See Table 3-8. *);
BREAK;
EAX = 80000002H:
EAX ← Processor Brand String;
EBX ← Processor Brand String, continued;
ECX ← Processor Brand String, continued;
EDX ← Processor Brand String, continued;
BREAK;
EAX = 80000003H:
EAX ← Processor Brand String, continued;
EBX ← Processor Brand String, continued;
ECX ← Processor Brand String, continued;
EDX ← Processor Brand String, continued;
BREAK;
EAX = 80000004H:
EAX ← Processor Brand String, continued;
EBX ← Processor Brand String, continued;
ECX ← Processor Brand String, continued;
EDX ← Processor Brand String, continued;
BREAK;
EAX = 80000005H:
EAX ← Reserved = 0;
EBX ← Reserved = 0;
ECX ← Reserved = 0;
EDX ← Reserved = 0;
BREAK;
EAX = 80000006H:
EAX ← Reserved = 0;
EBX ← Reserved = 0;
ECX ← Cache information;
EDX ← Reserved = 0;
BREAK;
EAX = 80000007H:
EAX ← Reserved = 0;
EBX ← Reserved = 0;
ECX ← Reserved = 0;
EDX ← Reserved = Misc Feature Flags;
BREAK;
EAX = 80000008H:
EAX ← Reserved = Physical Address Size Information;
EBX ← Reserved = Virtual Address Size Information;
ECX ← Reserved = 0;
EDX ← Reserved = 0;
BREAK;
EAX >= 40000000H and EAX <= 4FFFFFFFH:
DEFAULT: (* EAX = Value outside of recognized range for CPUID. *)
(* If the highest basic information leaf data depend on ECX input value, ECX is honored.*)
EAX ← Reserved; (* Information returned for highest basic information leaf. *)
EBX ← Reserved; (* Information returned for highest basic information leaf. *)
ECX ← Reserved; (* Information returned for highest basic information leaf. *)
EDX ← Reserved; (* Information returned for highest basic information leaf. *)
BREAK;
ESAC;
Flags Affected
None.
Description
Terminate an indirect branch in 32 bit and compatibility mode.
Operation
IF EndbranchEnabled(CPL) & (EFER.LMA = 0 | (EFER.LMA=1 & CS.L = 0)
IF CPL = 3
THEN
IA32_U_CET.TRACKER = IDLE
IA32_U_CET.SUPPRESS = 0
ELSE
IA32_S_CET.TRACKER = IDLE
IA32_S_CET.SUPPRESS = 0
FI;
FI;
Flags Affected
None.
Exceptions
None.
Description
Terminate an indirect branch in 64 bit mode.
Operation
IF EndbranchEnabled(CPL) & EFER.LMA = 1 & CS.L = 1
IF CPL = 3
THEN
IA32_U_CET.TRACKER = IDLE
IA32_U_CET.SUPPRESS = 0
ELSE
IA32_S_CET.TRACKER = IDLE
IA32_S_CET.SUPPRESS = 0
FI;
FI;
Flags Affected
None.
Exceptions
None.
Description
Pushes the source operand onto the FPU register stack. The source operand can be in single-precision, double-
precision, or double extended-precision floating-point format. If the source operand is in single-precision or
double-precision floating-point format, it is automatically converted to the double extended-precision floating-
point format before being pushed on the stack.
The FLD instruction can also push the value in a selected FPU register [ST(i)] onto the stack. Here, pushing register
ST(0) duplicates the stack top.
NOTE
When the FLD instruction loads a denormal value and the DM bit in the CW is not masked, an
exception is flagged but the value is still pushed onto the x87 stack.
This instruction’s operation is the same in non-64-bit modes and 64-bit mode.
Operation
IF SRC is ST(i)
THEN
temp ← ST(i);
FI;
TOP ← TOP − 1;
IF SRC is memory-operand
THEN
ST(0) ← ConvertToDoubleExtendedPrecisionFP(SRC);
ELSE (* SRC is ST(i) *)
ST(0) ← temp;
FI;
Floating-Point Exceptions
#IS Stack underflow or overflow occurred.
#IA Source operand is an SNaN. Does not occur if the source operand is in double extended-preci-
sion floating-point format (FLD m80fp or FLD ST(i)).
#D Source operand is a denormal value. Does not occur if the source operand is in double
extended-precision floating-point format.
Description
The AFFINEINVB instruction computes an affine transformation in the Galois Field 28. For this instruction, an affine
transformation is defined by A * inv(x) + b where “A” is an 8 by 8 bit matrix, and “x” and “b” are 8-bit vectors. The
inverse of the bytes in x is defined with respect to the reduction polynomial x8 + x4 + x3 + x + 1.
One SIMD register (operand 1) holds “x” as either 16, 32 or 64 8-bit vectors. A second SIMD (operand 2) register
or memory operand contains 2, 4, or 8 “A” values, which are operated upon by the correspondingly aligned 8 “x”
values in the first register. The “b” vector is constant for all calculations and contained in the immediate byte.
The EVEX encoded form of this instruction does not support memory fault suppression. The SSE encoded forms of
the instruction require 16B alignment on their memory operations.
The inverse of each byte is given by the following table. The upper nibble is on the vertical axis and the lower nibble
is on the horizontal axis. For example, the inverse of 0x95 is 0x8A.
Operation
define affine_inverse_byte(tsrc2qw, src1byte, imm):
FOR i ← 0 to 7:
* parity(x) = 1 if x has an odd number of 1s in it, and 0 otherwise.*
* inverse(x) is defined in the table above *
retbyte.bit[i] ← parity(tsrc2qw.byte[7-i] AND inverse(src1byte)) XOR imm8.bit[i]
return retbyte
FOR b ← 0 to 7:
IF k1[j*8+b] OR *no writemask*:
FOR i ← 0 to 7:
DEST.qword[j].byte[b] ← affine_inverse_byte(tsrc2, SRC1.qword[j].byte[b], imm8)
ELSE IF *zeroing*:
DEST.qword[j].byte[b] ← 0
*ELSE DEST.qword[j].byte[b] remains unchanged*
DEST[MAX_VL-1:VL] ← 0
VGF2P8AFFINEINVQB dest, src1, src2, imm8 (128b and 256b VEX encoded versions)
(KL, VL) = (2, 128), (4, 256)
FOR j ← 0 TO KL-1:
FOR b ← 0 to 7:
DEST.qword[j].byte[b] ← affine_inverse_byte(SRC2.qword[j], SRC1.qword[j].byte[b], imm8)
DEST[MAX_VL-1:VL] ← 0
Other Exceptions
Legacy-encoded and VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
Description
The AFFINEB instruction computes an affine transformation in the Galois Field 28. For this instruction, an affine
transformation is defined by A * x + b where “A” is an 8 by 8 bit matrix, and “x” and “b” are 8-bit vectors. One SIMD
register (operand 1) holds “x” as either 16, 32 or 64 8-bit vectors. A second SIMD (operand 2) register or memory
operand contains 2, 4, or 8 “A” values, which are operated upon by the correspondingly aligned 8 “x” values in the
first register. The “b” vector is constant for all calculations and contained in the immediate byte.
The EVEX encoded form of this instruction does not support memory fault suppression. The SSE encoded forms of
the instruction require16B alignment on their memory operations.
Operation
define parity(x):
t←0 // single bit
FOR i ← 0 to 7:
t = t xor x.bit[i]
return t
FOR b ← 0 to 7:
IF k1[j*8+b] OR *no writemask*:
DEST.qword[j].byte[b] ← affine_byte(tsrc2, SRC1.qword[j].byte[b], imm8)
ELSE IF *zeroing*:
DEST.qword[j].byte[b] ← 0
*ELSE DEST.qword[j].byte[b] remains unchanged*
DEST[MAX_VL-1:VL] ← 0
VGF2P8AFFINEQB dest, src1, src2, imm8 (128b and 256b VEX encoded versions)
(KL, VL) = (2, 128), (4, 256)
FOR j ← 0 TO KL-1:
FOR b ← 0 to 7:
DEST.qword[j].byte[b] ← affine_byte(SRC2.qword[j], SRC1.qword[j].byte[b], imm8)
DEST[MAX_VL-1:VL] ← 0
Other Exceptions
Legacy-encoded and VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
Description
The instruction multiplies elements in the finite field GF(28), operating on a byte (field element) in the first source
operand and the corresponding byte in a second source operand. The field GF(28) is represented in polynomial
representation with the reduction polynomial x8 + x4 + x3 + x + 1.
This instruction does not support broadcasting.
The EVEX encoded form of this instruction supports memory fault suppression. The SSE encoded forms of the
instruction require16B alignment on their memory operations.
Operation
define gf2p8mul_byte(src1byte, src2byte):
tword ← 0
FOR i ← 0 to 7:
IF src2byte.bit[i]:
tword ← tword XOR (src1byte<< i)
* carry out polynomial reduction by the characteristic polynomial p*
FOR i ← 14 downto 8:
p ← 0x11B << (i-8) *0x11B = 0000_0001_0001_1011 in binary*
IF tword.bit[i]:
tword ← tword XOR p
return tword.byte[0]
VGF2P8MULB dest, src1, src2 (128b and 256b VEX encoded versions)
(KL, VL) = (16, 128), (32, 256)
FOR j ← 0 TO KL-1:
DEST.byte[j] ← gf2p8mul_byte(SRC1.byte[j], SRC2.byte[j])
DEST[MAX_VL-1:VL] ← 0
Other Exceptions
Legacy-encoded and VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4.
INC—Increment by 1
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
FE /0 INC r/m8 M Valid Valid Increment r/m byte by 1.
REX + FE /0 INC r/m8* M Valid N.E. Increment r/m byte by 1.
FF /0 INC r/m16 M Valid Valid Increment r/m word by 1.
FF /0 INC r/m32 M Valid Valid Increment r/m doubleword by 1.
REX.W + FF /0 INC r/m64 M Valid N.E. Increment r/m quadword by 1.
**
40+ rw INC r16 O N.E. Valid Increment word register by 1.
40+ rd INC r32 O N.E. Valid Increment doubleword register by 1.
NOTES:
* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.
** 40H through 47H are REX prefixes in 64-bit mode.
Description
Adds 1 to the destination operand, while preserving the state of the CF flag. The destination operand can be a
register or a memory location. This instruction allows a loop counter to be updated without disturbing the CF flag.
(Use a ADD instruction with an immediate operand of 1 to perform an increment operation that does updates the
CF flag.)
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.
In 64-bit mode, INC r16 and INC r32 are not encodable (because opcodes 40H through 47H are REX prefixes).
Otherwise, the instruction’s 64-bit mode default operation size is 32 bits. Use of the REX.R prefix permits access to
additional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits.
Operation
DEST ← DEST + 1;
Flags Affected
The CF flag is not affected. The OF, SF, ZF, AF, and PF flags are set according to the result.
Description
This instruction can be used to increment the current shadow stack pointer by the operand size of the instruction
times the unsigned 8-bit value specified by bits 7:0 in the source operand. The instruction performs a pop and
discard of the first and last element on the shadow stack in the range specified by the unsigned 8-bit value in bits
7:0 of the source operand.
Operation
IF CPL = 3
IF (CR4.CET & IA32_U_CET.SH_STK_EN) = 0
THEN #UD; FI;
ELSE
IF (CR4.CET & IA32_S_CET.SH_STK_EN) = 0
THEN #UD; FI;
FI;
Flags Affected
None.
Description
The INT n instruction generates a call to the interrupt or exception handler specified with the destination operand
(see the section titled “Interrupts and Exceptions” in Chapter 6 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1). The destination operand specifies a vector from 0 to 255, encoded as an 8-bit
unsigned intermediate value. Each vector provides an index to a gate descriptor in the IDT. The first 32 vectors are
reserved by Intel for system use. Some of these vectors are used for internally generated exceptions.
The INT n instruction is the general mnemonic for executing a software-generated call to an interrupt handler. The
INTO instruction is a special mnemonic for calling overflow exception (#OF), exception 4. The overflow interrupt
checks the OF flag in the EFLAGS register and calls the overflow interrupt handler if the OF flag is set to 1. (The
INTO instruction cannot be used in 64-bit mode.)
The INT3 instruction uses a one-byte opcode (CC) and is intended for calling the debug exception handler with a
breakpoint exception (#BP). (This one-byte form is useful because it can replace the first byte of any instruction at
which a breakpoint is desired, including other one-byte instructions, without overwriting other instructions.)
The INT1 instruction also uses a one-byte opcode (F1) and generates a debug exception (#DB) without setting any
bits in DR6.1 Hardware vendors may use the INT1 instruction for hardware debug. For that reason, Intel recom-
mends software vendors instead use the INT3 instruction for software breakpoints.
An interrupt generated by the INTO, INT3, or INT1 instruction differs from one generated by INT n in the following
ways:
• The normal IOPL checks do not occur in virtual-8086 mode. The interrupt is taken (without fault) with any IOPL
value.
• The interrupt redirection enabled by the virtual-8086 mode extensions (VME) does not occur. The interrupt is
always handled by a protected-mode handler.
(These features do not pertain to CD03, the “normal” 2-byte opcode for INT 3. Intel and Microsoft assemblers will
not generate the CD03 opcode from any mnemonic, but this opcode can be created by direct numeric code defini-
tion or by self-modifying code.)
The action of the INT n instruction (including the INTO, INT3, and INT1 instructions) is similar to that of a far call
made with the CALL instruction. The primary difference is that with the INT n instruction, the EFLAGS register is
pushed onto the stack before the return address. (The return address is a far address consisting of the current
values of the CS and EIP registers.) Returns from interrupt procedures are handled with the IRET instruction, which
pops the EFLAGS information and return address from the stack.
Each of the INT n, INTO, and INT3 instructions generates a general-protection exception (#GP) if the CPL is greater
than the DPL value in the selected gate descriptor in the IDT. In contrast, the INT1 instruction can deliver a #DB
1. The mnemonic ICEBP has also been used for the instruction with opcode F1.
even if the CPL is greater than the DPL of descriptor 1 in the IDT. (This behavior supports the use of INT1 by hard-
ware vendors performing hardware debug.)
The vector specifies an interrupt descriptor in the interrupt descriptor table (IDT); that is, it provides index into the
IDT. The selected interrupt descriptor in turn contains a pointer to an interrupt or exception handler procedure.
In protected mode, the IDT contains an array of 8-byte descriptors, each of which is an interrupt gate, trap gate,
or task gate. In real-address mode, the IDT is an array of 4-byte far pointers (2-byte code segment selector and
a 2-byte instruction pointer), each of which point directly to a procedure in the selected segment. (Note that in
real-address mode, the IDT is called the interrupt vector table, and its pointers are called interrupt vectors.)
The following decision table indicates which action in the lower portion of the table is taken given the conditions in
the upper portion of the table. Each Y in the lower section of the decision table represents a procedure defined in
the “Operation” section for this instruction (except #GP).
When the processor is executing in virtual-8086 mode, the IOPL determines the action of the INT n instruction. If
the IOPL is less than 3, the processor generates a #GP(selector) exception; if the IOPL is 3, the processor executes
a protected mode interrupt to privilege level 0. The interrupt gate's DPL must be set to 3 and the target CPL of the
interrupt handler procedure must be 0 to execute the protected mode interrupt to privilege level 0.
The interrupt descriptor table register (IDTR) specifies the base linear address and limit of the IDT. The initial base
address value of the IDTR after the processor is powered up or reset is 0.
Refer to Chapter 6, “Procedure Calls, Interrupts, and Exceptions” and Chapter 18, “Control-Flow Enforcement
Technology (CET)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for CET
details.
Instruction ordering. Instructions following an INT n may be fetched from memory before earlier instructions
complete execution, but they will not execute (even speculatively) until all instructions prior to the INT n have
completed execution (the later instructions may execute before data stored by the earlier instructions have become
globally visible). This applies also to the INTO, INT3, and INT1 instructions, but not to executions of INTO when
EFLAGS.OF = 0.
Operation
The following operational description applies not only to the INT n, INTO, INT3, or INT1 instructions, but also to
external interrupts, nonmaskable interrupts (NMIs), and exceptions. Some of these events push onto the stack an
error code.
The operational description specifies numerous checks whose failure may result in delivery of a nested exception.
In these cases, the original event is not delivered.
The operational description specifies the error code delivered by any nested exception. In some cases, the error
code is specified with a pseudofunction error_code(num,idt,ext), where idt and ext are bit values. The pseudofunc-
tion produces an error code as follows: (1) if idt is 0, the error code is (num & FCH) | ext; (2) if idt is 1, the error
code is (num « 3) | 2 | ext.
In many cases, the pseudofunction error_code is invoked with a pseudovariable EXT. The value of EXT depends on
the nature of the event whose delivery encountered a nested exception: if that event is a software interrupt (INT n,
INT3, or INTO), EXT is 0; otherwise (including INT1), EXT is 1.
IF PE = 0
THEN
GOTO REAL-ADDRESS-MODE;
ELSE (* PE = 1 *)
IF (EFLAGS.VM = 1 AND CR4.VME = 0 AND IOPL < 3 AND INT n)
THEN
#GP(0); (* Bit 0 of error code is 0 because INT n *)
ELSE
IF (EFLAGS.VM = 1 AND CR4.VME = 1 AND INT n)
THEN
Consult bit n of the software interrupt redirection bit map in the TSS;
IF bit n is clear
THEN (* redirect interrupt to 8086 program interrupt handler *)
Push EFLAGS[15:0]; (* if IOPL < 3, save VIF in IF position and save IOPL position as 3 *)
Push CS;
Push IP;
IF IOPL = 3
THEN IF ← 0; (* Clear interrupt flag *)
ELSE VIF ← 0; (* Clear virtual interrupt flag *)
FI;
TF ← 0; (* Clear trap flag *)
load CS and EIP (lower 16 bits only) from entry n in interrupt vector table referenced from TSS;
ELSE
IF IOPL = 3
THEN GOTO PROTECTED-MODE;
ELSE #GP(0); (* Bit 0 of error code is 0 because INT n *)
FI;
FI;
ELSE (* Protected mode, IA-32e mode, or virtual-8086 mode interrupt *)
IF (IA32_EFER.LMA = 0)
THEN (* Protected mode, or virtual-8086 mode interrupt *)
GOTO PROTECTED-MODE;
ELSE (* IA-32e mode interrupt *)
GOTO IA-32e-MODE;
FI;
FI;
FI;
FI;
REAL-ADDRESS-MODE:
IF ((vector_number « 2) + 3) is not within IDT limit
THEN #GP; FI;
IF stack not large enough for a 6-byte return information
THEN #SS; FI;
Push (EFLAGS[15:0]);
IF ← 0; (* Clear interrupt flag *)
TF ← 0; (* Clear trap flag *)
AC ← 0; (* Clear AC flag *)
Push(CS);
Push(IP);
(* No error codes are pushed in real-address mode*)
CS ← IDT(Descriptor (vector_number « 2), selector));
EIP ← IDT(Descriptor (vector_number « 2), offset)); (* 16 bit offset AND 0000FFFFH *)
END;
PROTECTED-MODE:
IF ((vector_number « 3) + 7) is not within IDT limits
or selected IDT descriptor is not an interrupt-, trap-, or task-gate type
THEN #GP(error_code(vector_number,1,EXT)); FI;
(* idt operand to error_code set because vector is used *)
IF software interrupt (* Generated by INT n, INT3, or INTO; does not apply to INT1 *)
THEN
IF gate DPL < CPL (* PE = 1, DPL < CPL, software interrupt *)
THEN #GP(error_code(vector_number,1,0)); FI;
(* idt operand to error_code set because vector is used *)
(* ext operand to error_code is 0 because INT n, INT3, or INTO*)
FI;
IF gate not present
THEN #NP(error_code(vector_number,1,EXT)); FI;
(* idt operand to error_code set because vector is used *)
IF task gate (* Specified in the selected interrupt table descriptor *)
THEN GOTO TASK-GATE;
ELSE GOTO TRAP-OR-INTERRUPT-GATE; (* PE = 1, trap/interrupt gate *)
FI;
END;
IA-32e-MODE:
IF INTO and CS.L = 1 (64-bit mode)
THEN #UD;
FI;
IF ((vector_number « 4) + 15) is not in IDT limits
or selected IDT descriptor is not an interrupt-, or trap-gate type
THEN #GP(error_code(vector_number,1,EXT));
(* idt operand to error_code set because vector is used *)
FI;
IF software interrupt (* Generated by INT n, INT3, or INTO; does not apply to INT1 *)
THEN
IF gate DPL < CPL (* PE = 1, DPL < CPL, software interrupt *)
THEN #GP(error_code(vector_number,1,0));
(* idt operand to error_code set because vector is used *)
(* ext operand to error_code is 0 because INT n, INT3, or INTO*)
FI;
FI;
IF gate not present
THEN #NP(error_code(vector_number,1,EXT));
(* idt operand to error_code set because vector is used *)
FI;
GOTO TRAP-OR-INTERRUPT-GATE; (* Trap/interrupt gate *)
END;
TASK-GATE: (* PE = 1, task gate *)
Read TSS selector in task gate (IDT descriptor);
IF local/global bit is set to local or index not within GDT limits
THEN #GP(error_code(TSS selector,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
Access TSS descriptor in GDT;
IF TSS descriptor specifies that the TSS is busy (low-order 5 bits set to 00001)
THEN #GP(error_code(TSS selector,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
IF TSS not present
THEN #NP(error_code(TSS selector,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
SWITCH-TASKS (with nesting) to TSS;
IF interrupt caused by fault with error code
THEN
IF stack limit does not allow push of error code
THEN #SS(EXT); FI;
Push(error code);
FI;
IF EIP not within code segment limit
THEN #GP(EXT); FI;
END;
TRAP-OR-INTERRUPT-GATE:
Read new code-segment selector for trap or interrupt gate (IDT descriptor);
IF new code-segment selector is NULL
THEN #GP(EXT); FI; (* Error code contains NULL selector *)
IF new code-segment selector is not within its descriptor table limits
THEN #GP(error_code(new code-segment selector,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
Read descriptor referenced by new code-segment selector;
IF descriptor does not indicate a code segment or new code-segment DPL > CPL
THEN #GP(error_code(new code-segment selector,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
IF new code-segment descriptor is not present,
THEN #NP(error_code(new code-segment selector,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
IF new code segment is non-conforming with DPL < CPL
THEN
IF VM = 0
THEN
GOTO INTER-PRIVILEGE-LEVEL-INTERRUPT;
(* PE = 1, VM = 0, interrupt or trap gate, nonconforming code segment,
DPL < CPL *)
ELSE (* VM = 1 *)
IF new code-segment DPL ≠ 0
THEN #GP(error_code(new code-segment selector,0,EXT));
IF IDT-gate IST = 0
THEN TSSstackAddress ← (new code-segment DPL « 3) + 4;
ELSE TSSstackAddress ← (IDT gate IST « 3) + 28;
FI;
IF (TSSstackAddress + 7) > current TSS limit
THEN #TS(error_code(current TSS selector,0,EXT); FI;
(* idt operand to error_code is 0 because selector is used *)
NewRSP ← 8 bytes loaded from (current TSS base + TSSstackAddress);
NewSS ← new code-segment DPL; (* NULL selector with RPL = new CPL *)
IF IDT-gate IST = 0
THEN
NewSSP ← IA32_PLi_SSP (* where i = new code-segment DPL *)
ELSE
NewSSPAddress = IA32_INTERRUPT_SSP_TABLE_ADDR + (IDT-gate IST « 3)
(* Check if shadow stacks are enabled at CPL 0 *)
IF ShadowStackEnabled(CPL 0)
THEN NewSSP ← 8 bytes loaded from NewSSPAddress; FI;
FI;
FI;
IF IDT gate is 32-bit
THEN
IF new stack does not have room for 24 bytes (error code pushed)
or 20 bytes (no error code pushed)
THEN #SS(error_code(NewSS,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
FI
ELSE
IF IDT gate is 16-bit
THEN
IF new stack does not have room for 12 bytes (error code pushed)
or 10 bytes (no error code pushed);
THEN #SS(error_code(NewSS,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
ELSE (* 64-bit IDT gate*)
IF StackAddress is non-canonical
THEN #SS(EXT); FI; (* Error code contains NULL selector *)
FI;
FI;
IF (IA32_EFER.LMA = 0) (* Not IA-32e mode *)
THEN
IF instruction pointer from IDT gate is not within new code-segment limits
THEN #GP(EXT); FI; (* Error code contains NULL selector *)
ESP ← NewESP;
SS ← NewSS; (* Segment descriptor information also loaded *)
ELSE (* IA-32e mode *)
IF instruction pointer from IDT gate contains a non-canonical address
THEN #GP(EXT); FI; (* Error code contains NULL selector *)
RSP ← NewRSP & FFFFFFFFFFFFFFF0H;
SS ← NewSS;
FI;
IF IDT gate is 32-bit
THEN
CS:EIP ← Gate(CS:EIP); (* Segment descriptor information also loaded *)
ELSE
IF oldSS.DPL != 3
ShadowStackPush8B(oldCS); (* Padded with 48 high-order bits of 0 *)
ShadowStackPush8B(oldCSBASE + oldRIP); (* Padded with 32 high-order bits of 0 for 32 bit LIP*)
ShadowStackPush8B(oldSSP);
FI;
FI;
IF EndbranchEnabled (CPL)
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH;
IA32_S_CET.SUPPRESS = 0
FI;
IF IDT gate is interrupt gate
THEN IF ← 0 (* Interrupt flag set to 0, interrupts disabled *); FI;
TF ← 0;
VM ← 0;
RF ← 0;
NT ← 0;
END;
INTERRUPT-FROM-VIRTUAL-8086-MODE:
(* Identify stack-segment selector for privilege level 0 in current TSS *)
IF current TSS is 32-bit
THEN
IF TSS limit < 9
THEN #TS(error_code(current TSS selector,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
NewSS ← 2 bytes loaded from (current TSS base + 8);
NewESP ← 4 bytes loaded from (current TSS base + 4);
ELSE (* current TSS is 16-bit *)
IF TSS limit < 5
THEN #TS(error_code(current TSS selector,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
NewSS ← 2 bytes loaded from (current TSS base + 4);
NewESP ← 2 bytes loaded from (current TSS base + 2);
FI;
IF NewSS is NULL
THEN #TS(EXT); FI; (* Error code contains NULL selector *)
IF NewSS index is not within its descriptor table limits
or NewSS RPL ≠ 0
THEN #TS(error_code(NewSS,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
Read new stack-segment descriptor for NewSS in GDT or LDT;
IF new stack-segment DPL ≠ 0 or stack segment does not indicate writable data segment
THEN #TS(error_code(NewSS,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
IF new stack segment not present
THEN #SS(error_code(NewSS,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
NewSSP ← IA32_PLi_SSP (* where i = new code-segment DPL *)
IF IDT gate is 32-bit
THEN
IF new stack does not have room for 40 bytes (error code pushed)
or 36 bytes (no error code pushed)
THEN #SS(error_code(NewSS,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
ELSE (* IDT gate is 16-bit)
IF new stack does not have room for 20 bytes (error code pushed)
or 18 bytes (no error code pushed)
THEN #SS(error_code(NewSS,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
FI;
IF instruction pointer from IDT gate is not within new code-segment limits
THEN #GP(EXT); FI; (* Error code contains NULL selector *)
tempEFLAGS ← EFLAGS;
VM ← 0;
TF ← 0;
RF ← 0;
NT ← 0;
IF service through interrupt gate
THEN IF = 0; FI;
TempSS ← SS;
TempESP ← ESP;
SS ← NewSS;
ESP ← NewESP;
(* Following pushes are 16 bits for 16-bit IDT gates and 32 bits for 32-bit IDT gates;
Segment selector pushes in 32-bit mode are padded to two words *)
Push(GS);
Push(FS);
Push(DS);
Push(ES);
Push(TempSS);
Push(TempESP);
Push(TempEFlags);
Push(CS);
Push(EIP);
GS ← 0; (* Segment registers made NULL, invalid for use in protected mode *)
FS ← 0;
DS ← 0;
ES ← 0;
CS ← Gate(CS); (* Segment descriptor information also loaded *)
CS(RPL) ← 0;
CPL ← 0;
IF IDT gate is 32-bit
THEN
EIP ← Gate(instruction pointer);
ELSE (* IDT gate is 16-bit *)
EIP ← Gate(instruction pointer) AND 0000FFFFH;
FI;
IF ShadowStackEnabled(CPL)
oldSSP ← SSP
SSP ← NewSSP
IF SSP & 0x07 != 0
THEN #GP(0); FI;
IF ((EFER.LMA and CS.L) = 0 AND SSP[63:32] != 0)
THEN #GP(0); FI;
expected_token_value = SSP (* busy bit - bit position 0 - must be clear *)
new_token_value = SSP | BUSY_BIT (* Set the busy bit *)
IF shadow_stack_lock_cmpxchg8b(SSP, new_token_value, expected_token_value) != expected_token_value
THEN #GP(0); FI;
IF oldSS.DPL != 3
INTRA-PRIVILEGE-LEVEL-INTERRUPT:
NewSSP = SSP;
CHECK_SS_TOKEN = 0
(* PE = 1, DPL = CPL or conforming segment *)
IF IA32_EFER.LMA = 1 (* IA-32e mode *)
IF IDT-descriptor IST ≠ 0
THEN
TSSstackAddress ← (IDT-descriptor IST « 3) + 28;
IF (TSSstackAddress + 7) > TSS limit
THEN #TS(error_code(current TSS selector,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
NewRSP ← 8 bytes loaded from (current TSS base + TSSstackAddress);
ELSE NewRSP ← RSP;
FI;
IF ShadowStackEnabled(CPL)
THEN
NewSSPAddress = IA32_INTERRUPT_SSP_TABLE_ADDR + (IDT gate IST « 3)
NewSSP ← 8 bytes loaded from NewSSPAddress
CHECK_SS_TOKEN = 1
FI;
FI;
IF 32-bit gate (* implies IA32_EFER.LMA = 0 *)
THEN
IF current stack does not have room for 16 bytes (error code pushed)
or 12 bytes (no error code pushed)
THEN #SS(EXT); FI; (* Error code contains NULL selector *)
ELSE IF 16-bit gate (* implies IA32_EFER.LMA = 0 *)
IF current stack does not have room for 8 bytes (error code pushed)
or 6 bytes (no error code pushed)
THEN #SS(EXT); FI; (* Error code contains NULL selector *)
ELSE (* IA32_EFER.LMA = 1, 64-bit gate*)
IF NewRSP contains a non-canonical address
THEN #SS(EXT); (* Error code contains NULL selector *)
FI;
FI;
IF (IA32_EFER.LMA = 0) (* Not IA-32e mode *)
THEN
IF instruction pointer from IDT gate is not within new code-segment limit
THEN #GP(EXT); FI; (* Error code contains NULL selector *)
ELSE
IF instruction pointer from IDT gate contains a non-canonical address
THEN #GP(EXT); FI; (* Error code contains NULL selector *)
ELSE
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_S_CET.SUPPRESS = 0
FI;
FI;
IF IDT gate is interrupt gate
THEN IF ← 0; FI; (* Interrupt flag set to 0; interrupts disabled *)
TF ← 0;
NT ← 0;
VM ← 0;
RF ← 0;
END;
Flags Affected
The EFLAGS register is pushed onto the stack. The IF, TF, NT, AC, RF, and VM flags may be cleared, depending on
the mode of operation of the processor when the INT instruction is executed (see the “Operation” section). If the
interrupt uses a task gate, any flags may be set or cleared, controlled by the EFLAGS image in the new task’s TSS.
Description
Invalidates mappings in the translation lookaside buffers (TLBs) and paging-structure caches based on process-
context identifier (PCID). (See Section 4.10, “Caching Translation Information,” in Intel 64 and IA-32 Architecture
Software Developer’s Manual, Volume 3A.) Invalidation is based on the INVPCID type specified in the register
operand and the INVPCID descriptor specified in the memory operand.
Outside 64-bit mode, the register operand is always 32 bits, regardless of the value of CS.D. In 64-bit mode the
register operand has 64 bits.
There are four INVPCID types currently defined:
• Individual-address invalidation: If the INVPCID type is 0, the logical processor invalidates mappings—except
global translations—for the linear address and PCID specified in the INVPCID descriptor.1 In some cases, the
instruction may invalidate global translations or mappings for other linear addresses (or other PCIDs) as well.
• Single-context invalidation: If the INVPCID type is 1, the logical processor invalidates all mappings—except
global translations—associated with the PCID specified in the INVPCID descriptor. In some cases, the
instruction may invalidate global translations or mappings for other PCIDs as well.
• All-context invalidation, including global translations: If the INVPCID type is 2, the logical processor invalidates
all mappings—including global translations—associated with any PCID.
• All-context invalidation: If the INVPCID type is 3, the logical processor invalidates all mappings—except global
translations—associated with any PCID. In some case, the instruction may invalidate global translations as
well.
The INVPCID descriptor comprises 128 bits and consists of a PCID and a linear address as shown in Figure 3-24.
For INVPCID type 0, the processor uses the full 64 bits of the linear address even outside 64-bit mode; the linear
address is not used for other INVPCID types.
127 64 63 12 11 0
Linear Address Reserved (must be zero) PCID
1. If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page
(see Section 4.10.2.3, “Details of TLB Use,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A), the
instruction invalidates all of them.
If CR4.PCIDE = 0, a logical processor does not cache information for any PCID other than 000H. In this case,
executions with INVPCID types 0 and 1 are allowed only if the PCID specified in the INVPCID descriptor is 000H;
executions with INVPCID types 2 and 3 invalidate mappings only for PCID 000H. Note that CR4.PCIDE must be 0
outside IA-32e mode (see Chapter 4.10.1, “Process-Context Identifiers (PCIDs)‚” of the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 3A).
Operation
INVPCID_TYPE ← value of register operand; // must be in the range of 0–3
INVPCID_DESC ← value of memory operand;
CASE INVPCID_TYPE OF
0: // individual-address invalidation
PCID ← INVPCID_DESC[11:0];
L_ADDR ← INVPCID_DESC[127:64];
Invalidate mappings for L_ADDR associated with PCID except global translations;
BREAK;
1: // single PCID invalidation
PCID ← INVPCID_DESC[11:0];
Invalidate all mappings associated with PCID except global translations;
BREAK;
2: // all PCID invalidation including global translations
Invalidate all mappings for all PCIDs, including global translations;
BREAK;
3: // all PCID invalidation retaining global translations
Invalidate all mappings for all PCIDs except global translations;
BREAK;
ESAC;
IRET/IRETD/IRETQ—Interrupt Return
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
CF IRET ZO Valid Valid Interrupt return (16-bit operand size).
CF IRETD ZO Valid Valid Interrupt return (32-bit operand size).
REX.W + CF IRETQ ZO Valid N.E. Interrupt return (64-bit operand size).
Description
Returns program control from an exception or interrupt handler to a program or procedure that was interrupted by
an exception, an external interrupt, or a software-generated interrupt. These instructions are also used to perform
a return from a nested task. (A nested task is created when a CALL instruction is used to initiate a task switch or
when an interrupt or exception causes a task switch to an interrupt or exception handler.) See the section titled
“Task Linking” in Chapter 7 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
IRET and IRETD are mnemonics for the same opcode. The IRETD mnemonic (interrupt return double) is intended
for use when returning from an interrupt when using the 32-bit operand size; however, most assemblers use the
IRET mnemonic interchangeably for both operand sizes.
In Real-Address Mode, the IRET instruction preforms a far return to the interrupted program or procedure. During
this operation, the processor pops the return instruction pointer, return code segment selector, and EFLAGS image
from the stack to the EIP, CS, and EFLAGS registers, respectively, and then resumes execution of the interrupted
program or procedure.
In Protected Mode, the action of the IRET instruction depends on the settings of the NT (nested task) and VM flags
in the EFLAGS register and the VM flag in the EFLAGS image stored on the current stack. Depending on the setting
of these flags, the processor performs the following types of interrupt returns:
• Return from virtual-8086 mode.
• Return to virtual-8086 mode.
• Intra-privilege level return.
• Inter-privilege level return.
• Return from nested task (task switch).
If the NT flag (EFLAGS register) is cleared, the IRET instruction performs a far return from the interrupt procedure,
without a task switch. The code segment being returned to must be equally or less privileged than the interrupt
handler routine (as indicated by the RPL field of the code segment selector popped from the stack).
As with a real-address mode interrupt return, the IRET instruction pops the return instruction pointer, return code
segment selector, and EFLAGS image from the stack to the EIP, CS, and EFLAGS registers, respectively, and then
resumes execution of the interrupted program or procedure. If the return is to another privilege level, the IRET
instruction also pops the stack pointer and SS from the stack, before resuming program execution. If the return is
to virtual-8086 mode, the processor also pops the data segment registers from the stack.
If the NT flag is set, the IRET instruction performs a task switch (return) from a nested task (a task called with a
CALL instruction, an interrupt, or an exception) back to the calling or interrupted task. The updated state of the
task executing the IRET instruction is saved in its TSS. If the task is re-entered later, the code that follows the IRET
instruction is executed.
If the NT flag is set and the processor is in IA-32e mode, the IRET instruction causes a general protection excep-
tion.
If nonmaskable interrupts (NMIs) are blocked (see Section 6.7.1, “Handling Multiple NMIs” in the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 3A), execution of the IRET instruction unblocks NMIs.
This unblocking occurs even if the instruction causes a fault. In such a case, NMIs are unmasked before the excep-
tion handler is invoked.
In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.W prefix promotes operation to 64
bits (IRETQ). See the summary chart at the beginning of this section for encoding data and limits.
Refer to Chapter 6, “Procedure Calls, Interrupts, and Exceptions” and Chapter 18, “Control-Flow Enforcement
Technology (CET)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for CET
details.
Instruction ordering. IRET is a serializing instruction. See Section 8.3 of the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 3A.
See “Changes to Instruction Behavior in VMX Non-Root Operation” in Chapter 25 of the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 3C, for more information about the behavior of this instruction in
VMX non-root operation.
Operation
IF PE = 0
THEN GOTO REAL-ADDRESS-MODE;
ELSIF (IA32_EFER.LMA = 0)
THEN
IF (EFLAGS.VM = 1)
THEN GOTO RETURN-FROM-VIRTUAL-8086-MODE;
ELSE GOTO PROTECTED-MODE;
FI;
ELSE GOTO IA-32e-MODE;
FI;
REAL-ADDRESS-MODE;
IF OperandSize = 32
THEN
EIP ← Pop();
CS ← Pop(); (* 32-bit pop, high-order 16 bits discarded *)
tempEFLAGS ← Pop();
EFLAGS ← (tempEFLAGS AND 257FD5H) OR (EFLAGS AND 1A0000H);
ELSE (* OperandSize = 16 *)
EIP ← Pop(); (* 16-bit pop; clear upper 16 bits *)
CS ← Pop(); (* 16-bit pop *)
EFLAGS[15:0] ← Pop();
FI;
END;
RETURN-FROM-VIRTUAL-8086-MODE:
(* Processor is in virtual-8086 mode when IRET is executed and stays in virtual-8086 mode *)
IF IOPL = 3 (* Virtual mode: PE = 1, VM = 1, IOPL = 3 *)
THEN IF OperandSize = 32
THEN
EIP ← Pop();
CS ← Pop(); (* 32-bit pop, high-order 16 bits discarded *)
EFLAGS ← Pop();
(* VM, IOPL,VIP and VIF EFLAG bits not modified by pop *)
IF EIP not within CS limit
THEN #GP(0); FI;
ELSE (* OperandSize = 16 *)
EIP ← Pop(); (* 16-bit pop; clear upper 16 bits *)
CS ← Pop(); (* 16-bit pop *)
PROTECTED-MODE:
IF NT = 1
THEN GOTO TASK-RETURN; (* PE = 1, VM = 0, NT = 1 *)
FI;
IF OperandSize = 32
THEN
EIP ← Pop();
CS ← Pop(); (* 32-bit pop, high-order 16 bits discarded *)
tempEFLAGS ← Pop();
ELSE (* OperandSize = 16 *)
EIP ← Pop(); (* 16-bit pop; clear upper bits *)
CS ← Pop(); (* 16-bit pop *)
tempEFLAGS ← Pop(); (* 16-bit pop; clear upper bits *)
FI;
IF tempEFLAGS(VM) = 1 and CPL = 0
THEN GOTO RETURN-TO-VIRTUAL-8086-MODE;
ELSE GOTO PROTECTED-MODE-RETURN;
FI;
TASK-RETURN: (* PE = 1, VM = 0, NT = 1 *)
SWITCH-TASKS (without nesting) to TSS specified in link field of current TSS;
Mark the task just abandoned as NOT BUSY;
IF EIP is not within CS limit
THEN #GP(0); FI;
END;
RETURN-TO-VIRTUAL-8086-MODE:
(* Interrupted procedure was in virtual-8086 mode: PE = 1, CPL=0, VM = 1 in flag image *)
(* If shadow stack or indirect branch tracking at CPL3 then #GP(0) *)
IF CR4.CET AND (IA32_U_CET.ENDBR_EN OR IA32_U_CET.SHSTK_EN)
THEN #GP(0); FI;
shadowStackEnabled = ShadowStackEnabled(CPL)
IF EIP not within CS limit
THEN #GP(0); FI;
EFLAGS ← tempEFLAGS;
ESP ← Pop();
SS ← Pop(); (* Pop 2 words; throw away high-order word *)
ES ← Pop(); (* Pop 2 words; throw away high-order word *)
DS ← Pop(); (* Pop 2 words; throw away high-order word *)
FS ← Pop(); (* Pop 2 words; throw away high-order word *)
GS ← Pop(); (* Pop 2 words; throw away high-order word *)
IF shadowStackEnabled
(* check if 8 byte aligned *)
IF SSP AND 0x7 != 0
THEN #CP(FAR-RET/IRET); FI;
FI;
CPL ← 3;
(* Resume execution in Virtual-8086 mode *)
tempOldSSP = SSP;
(* Now past all faulting points; safe to free the token. The token free is done using the old SSP
* and using a supervisor override as old CPL was a supervisor privilege level *)
IF shadowStackEnabled
expected_token_value = tempOldSSP | BUSY_BIT (* busy bit - bit position 0 - must be set *)
new_token_value = tempOldSSP (* clear the busy bit *)
shadow_stack_lock_cmpxchg8b(tempOldSSP, new_token_value, expected_token_value)
FI;
END;
PROTECTED-MODE-RETURN: (* PE = 1 *)
IF CS(RPL) > CPL
THEN GOTO RETURN-TO-OUTER-PRIVILEGE-LEVEL;
ELSE GOTO RETURN-TO-SAME-PRIVILEGE-LEVEL; FI;
END;
RETURN-TO-OUTER-PRIVILEGE-LEVEL:
IF OperandSize = 32
THEN
ESP ← Pop();
SS ← Pop(); (* 32-bit pop, high-order 16 bits discarded *)
ELSE IF OperandSize = 16
THEN
ESP ← Pop(); (* 16-bit pop; clear upper bits *)
SS ← Pop(); (* 16-bit pop *)
ELSE (* OperandSize = 64 *)
RSP ← Pop();
SS ← Pop(); (* 64-bit pop, high-order 48 bits discarded *)
FI;
IF new mode ≠ 64-Bit Mode
THEN
IF EIP is not within CS limit
THEN #GP(0); FI;
ELSE (* new mode = 64-bit mode *)
IF RIP is non-canonical
THEN #GP(0); FI;
FI;
EFLAGS (CF, PF, AF, ZF, SF, TF, DF, OF, NT) ← tempEFLAGS;
IF OperandSize = 32 or or OperandSize = 64
THEN EFLAGS(RF, AC, ID) ← tempEFLAGS; FI;
IF CPL ≤ IOPL
THEN EFLAGS(IF) ← tempEFLAGS; FI;
IF CPL = 0
THEN
EFLAGS(IOPL) ← tempEFLAGS;
IF OperandSize = 32 or OperandSize = 64
THEN EFLAGS(VIF, VIP) ← tempEFLAGS; FI;
FI;
IF ShadowStackEnabled(CPL)
(* check if 8 byte aligned *)
IF tempSSP != SSP
THEN
expected_token_value = SSP | BUSY_BIT (* busy bit - bit position 0 - must be set *)
new_token_value = SSP (* clear the busy bit *)
shadow_stack_lock_cmpxchg8b(SSP, new_token_value, expected_token_value)
FI;
FI;
SSP ← tempSSP
FI;
FOR each of segment register (ES, FS, GS, and DS)
DO
IF segment register points to data or non-conforming code segment
and CPL > segment descriptor DPL (* DPL in hidden part of segment register *)
THEN SegmentSelector ← 0; (* Segment selector invalid *)
FI;
OD;
END;
IA-32e-MODE:
IF NT = 1
THEN #GP(0);
ELSE IF OperandSize = 32
THEN
EIP ← Pop();
CS ← Pop();
tempEFLAGS ← Pop();
ELSE IF OperandSize = 16
THEN
EIP ← Pop(); (* 16-bit pop; clear upper bits *)
CS ← Pop(); (* 16-bit pop *)
tempEFLAGS ← Pop(); (* 16-bit pop; clear upper bits *)
FI;
ELSE (* OperandSize = 64 *)
THEN
RIP ← Pop();
CS ← Pop(); (* 64-bit pop, high-order 48 bits discarded *)
tempRFLAGS ← Pop();
FI;
IF CS.RPL > CPL
THEN GOTO RETURN-TO-OUTER-PRIVILEGE-LEVEL;
ELSE
IF instruction began in 64-Bit Mode
THEN
IF OperandSize = 32
THEN
ESP ← Pop();
SS ← Pop(); (* 32-bit pop, high-order 16 bits discarded *)
ELSE IF OperandSize = 16
THEN
ESP ← Pop(); (* 16-bit pop; clear upper bits *)
SS ← Pop(); (* 16-bit pop *)
ELSE (* OperandSize = 64 *)
RSP ← Pop();
SS ← Pop(); (* 64-bit pop, high-order 48 bits discarded *)
FI;
FI;
GOTO RETURN-TO-SAME-PRIVILEGE-LEVEL; FI;
END;
Flags Affected
All the flags and fields in the EFLAGS register are potentially modified, depending on the mode of operation of the
processor. If performing a return from a nested task to a previous task, the EFLAGS register will be modified
according to the EFLAGS image stored in the previous task’s TSS.
JMP—Jump
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
EB cb JMP rel8 D Valid Valid Jump short, RIP = RIP + 8-bit displacement sign
extended to 64-bits
E9 cw JMP rel16 D N.S. Valid Jump near, relative, displacement relative to
next instruction. Not supported in 64-bit
mode.
E9 cd JMP rel32 D Valid Valid Jump near, relative, RIP = RIP + 32-bit
displacement sign extended to 64-bits
FF /4 JMP r/m16 M N.S. Valid Jump near, absolute indirect, address = zero-
extended r/m16. Not supported in 64-bit
mode.
FF /4 JMP r/m32 M N.S. Valid Jump near, absolute indirect, address given in
r/m32. Not supported in 64-bit mode.
FF /4 JMP r/m64 M Valid N.E. Jump near, absolute indirect, RIP = 64-Bit
offset from register or memory
EA cd JMP ptr16:16 D Inv. Valid Jump far, absolute, address given in operand
EA cp JMP ptr16:32 D Inv. Valid Jump far, absolute, address given in operand
FF /5 JMP m16:16 D Valid Valid Jump far, absolute indirect, address given in
m16:16
FF /5 JMP m16:32 D Valid Valid Jump far, absolute indirect, address given in
m16:32.
REX.W FF /5 JMP m16:64 D Valid N.E. Jump far, absolute indirect, address given in
m16:64.
Description
Transfers program control to a different point in the instruction stream without recording return information. The
destination (target) operand specifies the address of the instruction being jumped to. This operand can be an
immediate value, a general-purpose register, or a memory location.
This instruction can be used to execute four different types of jumps:
• Near jump—A jump to an instruction within the current code segment (the segment currently pointed to by the
CS register), sometimes referred to as an intrasegment jump.
• Short jump—A near jump where the jump range is limited to –128 to +127 from the current EIP value.
• Far jump—A jump to an instruction located in a different segment than the current code segment but at the
same privilege level, sometimes referred to as an intersegment jump.
• Task switch—A jump to an instruction located in a different task.
A task switch can only be executed in protected mode (see Chapter 7, in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3A, for information on performing task switches with the JMP instruction).
Near and Short Jumps. When executing a near jump, the processor jumps to the address (within the current code
segment) that is specified with the target operand. The target operand specifies either an absolute offset (that is
an offset from the base of the code segment) or a relative offset (a signed displacement relative to the current
value of the instruction pointer in the EIP register). A near jump to a relative offset of 8-bits (rel8) is referred to as
a short jump. The CS register is not changed on near and short jumps.
An absolute offset is specified indirectly in a general-purpose register or a memory location (r/m16 or r/m32). The
operand-size attribute determines the size of the target operand (16 or 32 bits). Absolute offsets are loaded
directly into the EIP register. If the operand-size attribute is 16, the upper two bytes of the EIP register are cleared,
resulting in a maximum instruction pointer size of 16 bits.
A relative offset (rel8, rel16, or rel32) is generally specified as a label in assembly code, but at the machine code
level, it is encoded as a signed 8-, 16-, or 32-bit immediate value. This value is added to the value in the EIP
register. (Here, the EIP register contains the address of the instruction following the JMP instruction). When using
relative offsets, the opcode (for short vs. near jumps) and the operand-size attribute (for near relative jumps)
determines the size of the target operand (8, 16, or 32 bits).
Far Jumps in Real-Address or Virtual-8086 Mode. When executing a far jump in real-address or virtual-8086 mode,
the processor jumps to the code segment and offset specified with the target operand. Here the target operand
specifies an absolute far address either directly with a pointer (ptr16:16 or ptr16:32) or indirectly with a memory
location (m16:16 or m16:32). With the pointer method, the segment and address of the called procedure is
encoded in the instruction, using a 4-byte (16-bit operand size) or 6-byte (32-bit operand size) far address imme-
diate. With the indirect method, the target operand specifies a memory location that contains a 4-byte (16-bit
operand size) or 6-byte (32-bit operand size) far address. The far address is loaded directly into the CS and EIP
registers. If the operand-size attribute is 16, the upper two bytes of the EIP register are cleared.
Far Jumps in Protected Mode. When the processor is operating in protected mode, the JMP instruction can be used
to perform the following three types of far jumps:
• A far jump to a conforming or non-conforming code segment.
• A far jump through a call gate.
• A task switch.
(The JMP instruction cannot be used to perform inter-privilege-level far jumps.)
In protected mode, the processor always uses the segment selector part of the far address to access the corre-
sponding descriptor in the GDT or LDT. The descriptor type (code segment, call gate, task gate, or TSS) and access
rights determine the type of jump to be performed.
If the selected descriptor is for a code segment, a far jump to a code segment at the same privilege level is
performed. (If the selected code segment is at a different privilege level and the code segment is non-conforming,
a general-protection exception is generated.) A far jump to the same privilege level in protected mode is very
similar to one carried out in real-address or virtual-8086 mode. The target operand specifies an absolute far
address either directly with a pointer (ptr16:16 or ptr16:32) or indirectly with a memory location (m16:16 or
m16:32). The operand-size attribute determines the size of the offset (16 or 32 bits) in the far address. The new
code segment selector and its descriptor are loaded into CS register, and the offset from the instruction is loaded
into the EIP register. Note that a call gate (described in the next paragraph) can also be used to perform far call to
a code segment at the same privilege level. Using this mechanism provides an extra level of indirection and is the
preferred method of making jumps between 16-bit and 32-bit code segments.
When executing a far jump through a call gate, the segment selector specified by the target operand identifies the
call gate. (The offset part of the target operand is ignored.) The processor then jumps to the code segment speci-
fied in the call gate descriptor and begins executing the instruction at the offset specified in the call gate. No stack
switch occurs. Here again, the target operand can specify the far address of the call gate either directly with a
pointer (ptr16:16 or ptr16:32) or indirectly with a memory location (m16:16 or m16:32).
Executing a task switch with the JMP instruction is somewhat similar to executing a jump through a call gate. Here
the target operand specifies the segment selector of the task gate for the task being switched to (and the offset
part of the target operand is ignored). The task gate in turn points to the TSS for the task, which contains the
segment selectors for the task’s code and stack segments. The TSS also contains the EIP value for the next instruc-
tion that was to be executed before the task was suspended. This instruction pointer value is loaded into the EIP
register so that the task begins executing again at this next instruction.
The JMP instruction can also specify the segment selector of the TSS directly, which eliminates the indirection of the
task gate. See Chapter 7 in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for
detailed information on the mechanics of a task switch.
Note that when you execute at task switch with a JMP instruction, the nested task flag (NT) is not set in the EFLAGS
register and the new TSS’s previous task link field is not loaded with the old task’s TSS selector. A return to the
previous task can thus not be carried out by executing the IRET instruction. Switching tasks with the JMP instruc-
tion differs in this regard from the CALL instruction which does set the NT flag and save the previous task link infor-
mation, allowing a return to the calling task with an IRET instruction.
Refer to Chapter 6, “Procedure Calls, Interrupts, and Exceptions” and Chapter 18, “Control-Flow Enforcement Tech-
nology (CET)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for CET details.
In 64-Bit Mode. The instruction’s operation size is fixed at 64 bits. If a selector points to a gate, then RIP equals the
64-bit displacement taken from gate; else RIP equals the zero-extended offset from the far pointer referenced in
the instruction.
See the summary chart at the beginning of this section for encoding data and limits.
Instruction ordering. Instructions following a far jump may be fetched from memory before earlier instructions
complete execution, but they will not execute (even speculatively) until all instructions prior to the far jump have
completed execution (the later instructions may execute before data stored by the earlier instructions have become
globally visible).
Certain situations may lead to the next sequential instruction after a near indirect JMP being speculatively
executed. If software needs to prevent this (e.g., in order to prevent a speculative execution side channel), then an
INT3 or LFENCE instruction opcode can be placed after the near indirect JMP in order to block speculative execution.
Operation
IF near jump
IF 64-bit Mode
THEN
IF near relative jump
THEN
tempRIP ← RIP + DEST; (* RIP is instruction following JMP instruction*)
ELSE (* Near absolute jump *)
tempRIP ← DEST;
FI;
ELSE
IF near relative jump
THEN
tempEIP ← EIP + DEST; (* EIP is instruction following JMP instruction*)
ELSE (* Near absolute jump *)
tempEIP ← DEST;
FI;
FI;
IF (IA32_EFER.LMA = 0 or target mode = Compatibility mode)
and tempEIP outside code segment limit
THEN #GP(0); FI
IF 64-bit mode and tempRIP is not canonical
THEN #GP(0);
FI;
IF OperandSize = 32
THEN
EIP ← tempEIP;
ELSE
IF OperandSize = 16
THEN (* OperandSize = 16 *)
EIP ← tempEIP AND 0000FFFFH;
ELSE (* OperandSize = 64)
RIP ← tempRIP;
FI;
FI;
IF (JMP near indirect, absolute indirect)
IF EndbranchEnabledAndNotSuppressed(CPL)
IF CPL = 3
THEN
IF ( no 3EH prefix OR IA32_U_CET.NO_TRACK_EN == 0 )
THEN
IA32_U_CET.TRACKER = WAIT_FOR_ENDBRANCH
FI;
ELSE
IF ( no 3EH prefix OR IA32_S_CET.NO_TRACK_EN == 0 )
THEN
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
FI;
FI;
FI;
FI;
FI;
IF far jump and (PE = 0 or (PE = 1 AND VM = 1)) (* Real-address or virtual-8086 mode *)
THEN
tempEIP ← DEST(Offset); (* DEST is ptr16:32 or [m16:32] *)
IF tempEIP is beyond code segment limit
THEN #GP(0); FI;
CS ← DEST(segment selector); (* DEST is ptr16:32 or [m16:32] *)
IF OperandSize = 32
THEN
EIP ← tempEIP; (* DEST is ptr16:32 or [m16:32] *)
ELSE (* OperandSize = 16 *)
EIP ← tempEIP AND 0000FFFFH; (* Clear upper 16 bits *)
FI;
FI;
IF far jump and (PE = 1 and VM = 0)
(* IA-32e mode or protected mode, not virtual-8086 mode *)
THEN
IF effective address in the CS, DS, ES, FS, GS, or SS segment is illegal
or segment selector in target operand NULL
THEN #GP(0); FI;
IF segment selector index not within descriptor table limits
THEN #GP(new selector); FI;
Read type and access rights of segment descriptor;
IF (EFER.LMA = 0)
THEN
IF segment type is not a conforming or nonconforming code
segment, call gate, task gate, or TSS
THEN #GP(segment selector); FI;
ELSE
IF segment type is not a conforming or nonconforming code segment
call gate
THEN #GP(segment selector); FI;
FI;
Depending on type and access rights:
GO TO CONFORMING-CODE-SEGMENT;
GO TO NONCONFORMING-CODE-SEGMENT;
GO TO CALL-GATE;
GO TO TASK-GATE;
GO TO TASK-STATE-SEGMENT;
ELSE
#GP(segment selector);
FI;
CONFORMING-CODE-SEGMENT:
IF L-Bit = 1 and D-BIT = 1 and IA32_EFER.LMA = 1
THEN GP(new code segment selector); FI;
IF DPL > CPL
THEN #GP(segment selector); FI;
IF segment not present
THEN #NP(segment selector); FI;
tempEIP ← DEST(Offset);
IF OperandSize = 16
THEN tempEIP ← tempEIP AND 0000FFFFH;
FI;
IF (IA32_EFER.LMA = 0 or target mode = Compatibility mode) and
tempEIP outside code segment limit
THEN #GP(0); FI
IF tempEIP is non-canonical
THEN #GP(0); FI;
IF ShadowStackEnabled(CPL)
IF (EFER.LMA and DEST(segment selector).L) = 0
(* If target is legacy or compatibility mode then the SSP must be in low 4GB *)
IF (SSP & 0xFFFFFFFF00000000 != 0)
THEN #GP(0); FI;
FI;
FI;
CS ← DEST[segment selector]; (* Segment descriptor information also loaded *)
CS(RPL) ← CPL
EIP ← tempEIP;
IF EndbranchEnabled(CPL)
IF CPL = 3
THEN
IA32_U_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_U_CET.SUPPRESS = 0
ELSE
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_S_CET.SUPPRESS = 0
FI;
FI;
END;
NONCONFORMING-CODE-SEGMENT:
IF L-Bit = 1 and D-BIT = 1 and IA32_EFER.LMA = 1
THEN GP(new code segment selector); FI;
IF (RPL > CPL) OR (DPL ≠ CPL)
THEN #GP(code segment selector); FI;
IF segment not present
THEN #NP(segment selector); FI;
tempEIP ← DEST(Offset);
IF OperandSize = 16
THEN tempEIP ← tempEIP AND 0000FFFFH; FI;
IF (IA32_EFER.LMA = 0 OR target mode = Compatibility mode)
and tempEIP outside code segment limit
THEN #GP(0); FI
IF tempEIP is non-canonical THEN #GP(0); FI;
IF ShadowStackEnabled(CPL)
IF (EFER.LMA and DEST(segment selector).L) = 0
(* If target is legacy or compatibility mode then the SSP must be in low 4GB *)
IF (SSP & 0xFFFFFFFF00000000 != 0)
THEN #GP(0); FI;
FI;
FI;
CS ← DEST[segment selector]; (* Segment descriptor information also loaded *)
CS(RPL) ← CPL;
EIP ← tempEIP;
IF EndbranchEnabled(CPL)
IF CPL = 3
THEN
IA32_U_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_U_CET.SUPPRESS = 0
ELSE
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_S_CET.SUPPRESS = 0
FI;
FI;
END;
CALL-GATE:
IF call gate DPL < CPL
or call gate DPL < call gate segment-selector RPL
THEN #GP(call gate selector); FI;
IF call gate not present
THEN #NP(call gate selector); FI;
IF call gate code-segment selector is NULL
THEN #GP(0); FI;
IF call gate code-segment selector index outside descriptor table limits
THEN #GP(code segment selector); FI;
Read code segment descriptor;
IF code-segment segment descriptor does not indicate a code segment
or code-segment segment descriptor is conforming and DPL > CPL
or code-segment segment descriptor is non-conforming and DPL ≠ CPL
THEN #GP(code segment selector); FI;
IF IA32_EFER.LMA = 1 and (code-segment descriptor is not a 64-bit code segment
or code-segment segment descriptor has both L-Bit and D-bit set)
THEN #GP(code segment selector); FI;
IF code segment is not present
THEN #NP(code-segment selector); FI;
tempEIP ← DEST(Offset);
IF GateSize = 16
THEN tempEIP ← tempEIP AND 0000FFFFH; FI;
IF (IA32_EFER.LMA = 0 OR target mode = Compatibility mode) AND tempEIP
outside code segment limit
THEN #GP(0); FI
CS ← DEST[SegmentSelector]; (* Segment descriptor information also loaded *)
CS(RPL) ← CPL;
EIP ← tempEIP;
IF EndbranchEnabled(CPL)
IF CPL = 3
THEN
IA32_U_CET.TRACKER = WAIT_FOR_ENDBRANCH;
IA32_U_CET.SUPPRESS = 0
ELSE
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH;
IA32_S_CET.SUPPRESS = 0
FI;
FI;
END;
TASK-GATE:
IF task gate DPL < CPL
or task gate DPL < task gate segment-selector RPL
THEN #GP(task gate selector); FI;
IF task gate not present
THEN #NP(gate selector); FI;
Read the TSS segment selector in the task-gate descriptor;
IF TSS segment selector local/global bit is set to local
or index not within GDT limits
or descriptor is not a TSS segment
or TSS descriptor specifies that the TSS is busy
THEN #GP(TSS selector); FI;
IF TSS not present
THEN #NP(TSS selector); FI;
SWITCH-TASKS to TSS;
IF EIP not within code segment limit
THEN #GP(0); FI;
END;
TASK-STATE-SEGMENT:
IF TSS DPL < CPL
or TSS DPL < TSS segment-selector RPL
or TSS descriptor indicates TSS not available
THEN #GP(TSS selector); FI;
IF TSS is not present
THEN #NP(TSS selector); FI;
SWITCH-TASKS to TSS;
IF EIP not within code segment limit
THEN #GP(0); FI;
END;
Flags Affected
All flags are affected if a task switch occurs; no flags are affected if a task switch does not occur.
If the segment descriptor pointed to by the segment selector in the destination operand is not
for a conforming-code segment, nonconforming-code segment, call gate, task gate, or task
state segment.
If the DPL for a nonconforming-code segment is not equal to the CPL
(When not using a call gate.) If the RPL for the segment’s segment selector is greater than the
CPL.
If the DPL for a conforming-code segment is greater than the CPL.
If the DPL from a call-gate, task-gate, or TSS segment descriptor is less than the CPL or than
the RPL of the call-gate, task-gate, or TSS’s segment selector.
If the segment descriptor for selector in a call gate does not indicate it is a code segment.
If the segment descriptor for the segment selector in a task gate does not indicate an available
TSS.
If the segment selector for a TSS has its local/global bit set for local.
If a TSS segment descriptor specifies that the TSS is busy or not available.
#SS(0) If a memory operand effective address is outside the SS segment limit.
#NP (selector) If the code segment being accessed is not present.
If call gate, task gate, or TSS not present.
#PF(fault-code) If a page fault occurs.
#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the
current privilege level is 3. (Only occurs when fetching target from memory.)
#UD If the LOCK prefix is used.
------------------------------------------------------------------------------------------
Changes to this chapter:
Updates to the following instructions: MOV, MOVDIR64B, PALIGNR, PCLMULQDQ, PHADDSW, PHADDW/PHADDD,
PHSUBSW, PHSUBW/PHSUBD, POP, and SHA256RNDS2.
Added Control-flow Enforcement Technology details to the following instructions: RET, SYSCALL, SYSENTER,
SYSEXIT, and SYSRET.
MOV—Move
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
88 /r MOV r/m8,r8 MR Valid Valid Move r8 to r/m8.
REX + 88 /r MOV r/m8***,r8*** MR Valid N.E. Move r8 to r/m8.
89 /r MOV r/m16,r16 MR Valid Valid Move r16 to r/m16.
89 /r MOV r/m32,r32 MR Valid Valid Move r32 to r/m32.
REX.W + 89 /r MOV r/m64,r64 MR Valid N.E. Move r64 to r/m64.
8A /r MOV r8,r/m8 RM Valid Valid Move r/m8 to r8.
REX + 8A /r MOV r8***,r/m8*** RM Valid N.E. Move r/m8 to r8.
8B /r MOV r16,r/m16 RM Valid Valid Move r/m16 to r16.
8B /r MOV r32,r/m32 RM Valid Valid Move r/m32 to r32.
REX.W + 8B /r MOV r64,r/m64 RM Valid N.E. Move r/m64 to r64.
8C /r MOV r/m16,Sreg** MR Valid Valid Move segment register to r/m16.
8C /r MOV r16/r32/m16, Sreg** MR Valid Valid Move zero extended 16-bit segment register
to r16/r32/m16.
REX.W + 8C /r MOV r64/m16, Sreg** MR Valid Valid Move zero extended 16-bit segment register
to r64/m16.
8E /r MOV Sreg,r/m16** RM Valid Valid Move r/m16 to segment register.
REX.W + 8E /r MOV Sreg,r/m64** RM Valid Valid Move lower 16 bits of r/m64 to segment
register.
A0 MOV AL,moffs8* FD Valid Valid Move byte at (seg:offset) to AL.
REX.W + A0 MOV AL,moffs8* FD Valid N.E. Move byte at (offset) to AL.
A1 MOV AX,moffs16* FD Valid Valid Move word at (seg:offset) to AX.
A1 MOV EAX,moffs32* FD Valid Valid Move doubleword at (seg:offset) to EAX.
REX.W + A1 MOV RAX,moffs64* FD Valid N.E. Move quadword at (offset) to RAX.
A2 MOV moffs8,AL TD Valid Valid Move AL to (seg:offset).
REX.W + A2 MOV moffs8***,AL TD Valid N.E. Move AL to (offset).
A3 MOV moffs16*,AX TD Valid Valid Move AX to (seg:offset).
A3 MOV moffs32*,EAX TD Valid Valid Move EAX to (seg:offset).
REX.W + A3 MOV moffs64*,RAX TD Valid N.E. Move RAX to (offset).
B0+ rb ib MOV r8, imm8 OI Valid Valid Move imm8 to r8.
REX + B0+ rb ib MOV r8***, imm8 OI Valid N.E. Move imm8 to r8.
B8+ rw iw MOV r16, imm16 OI Valid Valid Move imm16 to r16.
B8+ rd id MOV r32, imm32 OI Valid Valid Move imm32 to r32.
REX.W + B8+ rd io MOV r64, imm64 OI Valid N.E. Move imm64 to r64.
C6 /0 ib MOV r/m8, imm8 MI Valid Valid Move imm8 to r/m8.
REX + C6 /0 ib MOV r/m8***, imm8 MI Valid N.E. Move imm8 to r/m8.
C7 /0 iw MOV r/m16, imm16 MI Valid Valid Move imm16 to r/m16.
C7 /0 id MOV r/m32, imm32 MI Valid Valid Move imm32 to r/m32.
REX.W + C7 /0 id MOV r/m64, imm32 MI Valid N.E. Move imm32 sign extended to 64-bits to
r/m64.
NOTES:
* The moffs8, moffs16, moffs32 and moffs64 operands specify a simple offset relative to the segment base, where 8, 16, 32 and 64
refer to the size of the data. The address-size attribute of the instruction determines the size of the offset, either 16, 32 or 64
bits.
** In 32-bit mode, the assembler may insert the 16-bit operand-size prefix with this instruction (see the following “Description” sec-
tion for further information).
***In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.
Description
Copies the second operand (source operand) to the first operand (destination operand). The source operand can be
an immediate value, general-purpose register, segment register, or memory location; the destination register can
be a general-purpose register, segment register, or memory location. Both operands must be the same size, which
can be a byte, a word, a doubleword, or a quadword.
The MOV instruction cannot be used to load the CS register. Attempting to do so results in an invalid opcode excep-
tion (#UD). To load the CS register, use the far JMP, CALL, or RET instruction.
If the destination operand is a segment register (DS, ES, FS, GS, or SS), the source operand must be a valid
segment selector. In protected mode, moving a segment selector into a segment register automatically causes the
segment descriptor information associated with that segment selector to be loaded into the hidden (shadow) part
of the segment register. While loading this information, the segment selector and segment descriptor information
is validated (see the “Operation” algorithm below). The segment descriptor data is obtained from the GDT or LDT
entry for the specified segment selector.
A NULL segment selector (values 0000-0003) can be loaded into the DS, ES, FS, and GS registers without causing
a protection exception. However, any subsequent attempt to reference a segment whose corresponding segment
register is loaded with a NULL value causes a general protection exception (#GP) and no memory reference occurs.
Loading the SS register with a MOV instruction suppresses or inhibits some debug exceptions and inhibits inter-
rupts on the following instruction boundary. (The inhibition ends after delivery of an exception or the execution of
the next instruction.) This behavior allows a stack pointer to be loaded into the ESP register with the next instruc-
tion (MOV ESP, stack-pointer value) before an event can be delivered. See Section 6.8.3, “Masking Exceptions
and Interrupts When Switching Stacks,” in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
3A. Intel recommends that software use the LSS instruction to load the SS register and ESP together.
When executing MOV Reg, Sreg, the processor copies the content of Sreg to the 16 least significant bits of the
general-purpose register. The upper bits of the destination register are zero for most IA-32 processors (Pentium
Pro processors and later) and all Intel 64 processors, with the exception that bits 31:16 are undefined for Intel
Quark X1000 processors, Pentium and earlier processors.
In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi-
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the
beginning of this section for encoding data and limits.
Operation
DEST ← SRC;
Loading a segment register while in protected mode results in special checks and actions, as described in the
following listing. These checks are performed on the segment selector and the segment descriptor to which it
points.
IF SS is loaded
THEN
IF segment selector is NULL
THEN #GP(0); FI;
IF segment selector index is outside descriptor table limits
OR segment selector's RPL ≠ CPL
OR segment is not a writable data segment
OR DPL ≠ CPL
THEN #GP(selector); FI;
IF segment not marked present
THEN #SS(selector);
ELSE
SS ← segment selector;
SS ← segment descriptor; FI;
FI;
Flags Affected
None
Description
Moves 64-bytes as direct-store with 64-byte write atomicity from source memory address to destination memory
address. The source operand is a normal memory operand. The destination operand is a memory location specified
in a general-purpose register. The register content is interpreted as an offset into ES segment without any segment
override. In 64-bit mode, the register operand width is 64-bits (32-bits with 67H prefix). Outside of 64-bit mode,
the register width is 32-bits when CS.D=1 (16-bits with 67H prefix), and 16-bits when CS.D=0 (32-bits with 67H
prefix). MOVDIR64B requires the destination address to be 64-byte aligned. No alignment restriction is enforced
for source operand.
MOVDIR64B first reads 64-bytes from the source memory address. It then performs a 64-byte direct-store opera-
tion to the destination address. The load operation follows normal read ordering based on source address memory-
type. The direct-store is implemented by using the write combining (WC) memory type protocol for writing data.
Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corre-
sponding cache line from memory into the cache hierarchy. If the destination address is cached, the line is written-
back (if modified) and invalidated from the cache, before the direct-store.
Unlike stores with non-temporal hint which allow UC/WP memory-type for destination to override the non-temporal
hint, direct-stores always follow WC memory type protocol irrespective of destination address memory type
(including UC/WP types). Unlike WC stores and stores with non-temporal hint, direct-stores are eligible for imme-
diate eviction from the write-combining buffer, and thus not combined with younger stores (including direct-stores)
to the same address. Older WC and non-temporal stores held in the write-combing buffer may be combined with
younger direct stores to the same address. Because WC protocol used by direct-stores follow weakly-ordered
memory consistency model, fencing operation using SFENCE or MFENCE should follow the MOVDIR64B instruction
to enforce ordering when needed.
There is no atomicity guarantee provided for the 64-byte load operation from source address, and processor imple-
mentations may use multiple load operations to read the 64-bytes. The 64-byte direct-store issued by MOVDIR64B
guarantees 64-byte write-completion atomicity. This means that the data arrives at the destination in a single undi-
vided 64-byte write transaction.
Availability of the MOVDIR64B instruction is indicated by the presence of the CPUID feature flag MOVDIR64B (bit
28 of the ECX register in leaf 07H, see “CPUID—CPU Identification” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A).
Operation
DEST ← SRC;
1. The Mod field of the ModR/M byte cannot have value 11B.
Description
(V)PALIGNR concatenates the destination operand (the first operand) and the source operand (the second
operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant imme-
diate, and extracts the right-aligned result into the destination. The first and the second operands can be an MMX,
XMM or a YMM register. The immediate value is considered unsigned. Immediate shift counts larger than the 2L
(i.e. 32 for 128-bit operands, or 16 for 64-bit operands) produce a zero result. Both operands can be MMX regis-
ters, XMM registers or YMM registers. When the source operand is a 128-bit memory operand, the operand must
be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.
In 64-bit mode and not encoded by VEX/EVEX prefix, use the REX prefix to access additional registers.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain unchanged.
EVEX.512 encoded version: The first source operand is a ZMM register and contains four 16-byte blocks. The
second source operand is a ZMM register or a 512-bit memory location containing four 16-byte block. The destina-
tion operand is a ZMM register and contain four 16-byte results. The imm8[7:0] is the common shift count
used for each of the four successive 16-byte block sources. The low 16-byte block of the two source operands
produce the low 16-byte result of the destination operand, the high 16-byte block of the two source operands
produce the high 16-byte result of the destination operand and so on for the blocks in the middle.
VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register and contains two 16-byte
blocks. The second source operand is a YMM register or a 256-bit memory location containing two 16-byte block.
The destination operand is a YMM register and contain two 16-byte results. The imm8[7:0] is the common shift
count used for the two lower 16-byte block sources and the two upper 16-byte block sources. The low 16-byte
block of the two source operands produce the low 16-byte result of the destination operand, the high 16-byte block
of the two source operands produce the high 16-byte result of the destination operand. The upper bits (MAXVL-
1:256) of the corresponding ZMM register destination are zeroed.
VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand
is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-
1:128) of the corresponding ZMM register destination are zeroed.
Concatenation is done with 128-bit data in the first and second source operand for both 128-bit and 256-bit
instructions. The high 128-bits of the intermediate composite 256-bit result came from the 128-bit data from the
first source operand; the low 128-bits of the intermediate result came from the 128-bit data of the second source
operand.
127 0 127 0
SRC1 SRC2
Imm8[7:0]*8
255 128 255 128
SRC1 SRC2
Imm8[7:0]*8
DEST DEST
Operation
PALIGNR (with 64-bit operands)
temp1[127:0] = CONCATENATE(DEST,SRC)>>(imm8*8)
DEST[63:0] = temp1[63:0]
FOR j 0 TO KL-1
ij*8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] TMP_DEST[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] 0
Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type 4.
EVEX-encoded instruction, see Exceptions Type E4NF.nb.
Description
Performs a carry-less multiplication of two quadwords, selected from the first source and second source operand
according to the value of the immediate byte. Bits 4 and 0 are used to select which 64-bit half of each operand to
use according to Table 4-13, other bits of the immediate byte are ignored.
The EVEX encoded form of this instruction does not support memory fault suppression.
The first source operand and the destination operand are the same and must be a ZMM/YMM/XMM register. The
second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. Bits (VL_MAX-
1:128) of the corresponding YMM destination register remain unchanged.
Compilers and assemblers may implement the following pseudo-op syntax to simplify programming and emit the
required encoding for imm8.
Operation
define PCLMUL128(X,Y): // helper function
FOR i ← 0 to 63:
TMP [ i ] ← X[ 0 ] and Y[ i ]
FOR j ← 1 to i:
TMP [ i ] ← TMP [ i ] xor (X[ j ] and Y[ i - j ])
DEST[ i ] ← TMP[ i ]
FOR i ← 64 to 126:
TMP [ i ] ← 0
FOR j ← i - 63 to 63:
TMP [ i ] ← TMP [ i ] xor (X[ j ] and Y[ i - j ])
DEST[ i ] ← TMP[ i ]
DEST[127] ← 0;
RETURN DEST // 128b vector
Other Exceptions
See Exceptions Type 4, additionally
#UD If VEX.L = 1.
EVEX-encoded: See Exceptions Type E4NF.
VEX.128.66.0F38.WIG 03 /r RVM V/V AVX Add 16-bit signed integers horizontally, pack
VPHADDSW xmm1, xmm2, xmm3/m128 saturated integers to xmm1.
VEX.256.66.0F38.WIG 03 /r RVM V/V AVX2 Add 16-bit signed integers horizontally, pack
VPHADDSW ymm1, ymm2, ymm3/m256 saturated integers to ymm1.
NOTES:
1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
Description
(V)PHADDSW adds two adjacent signed 16-bit integers horizontally from the source and destination operands and
saturates the signed results; packs the signed, saturated 16-bit results to the destination operand (first operand)
When the source operand is a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a
general-protection exception (#GP) will be generated.
Legacy SSE version: Both operands can be MMX registers. The second source operand can be an MMX register or a
64-bit memory location.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
In 64-bit mode, use the REX prefix to access additional registers.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed.
VEX.256 encoded version: The first source and destination operands are YMM registers. The second source
operand can be an YMM register or a 256-bit memory location.
Operation
PHADDSW (with 64-bit operands)
mm1[15-0] = SaturateToSignedWord((mm1[31-16] + mm1[15-0]);
mm1[31-16] = SaturateToSignedWord(mm1[63-48] + mm1[47-32]);
mm1[47-32] = SaturateToSignedWord(mm2/m64[31-16] + mm2/m64[15-0]);
mm1[63-48] = SaturateToSignedWord(mm2/m64[63-48] + mm2/m64[47-32]);
Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
VEX.256.66.0F38.WIG 01 /r RVM V/V AVX2 Add 16-bit signed integers horizontally, pack
VPHADDW ymm1, ymm2, ymm3/m256 to ymm1.
VEX.256.66.0F38.WIG 02 /r RVM V/V AVX2 Add 32-bit signed integers horizontally, pack
VPHADDD ymm1, ymm2, ymm3/m256 to ymm1.
NOTES:
1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
Description
(V)PHADDW adds two adjacent 16-bit signed integers horizontally from the source and destination operands and
packs the 16-bit signed results to the destination operand (first operand). (V)PHADDD adds two adjacent 32-bit
signed integers horizontally from the source and destination operands and packs the 32-bit signed results to the
destination operand (first operand). When the source operand is a 128-bit memory operand, the operand must be
aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.
Note that these instructions can operate on either unsigned or signed (two’s complement notation) integers;
however, it does not set bits in the EFLAGS register to indicate overflow and/or a carry. To prevent undetected
overflow conditions, software must control the ranges of the values operated on.
Legacy SSE instructions: Both operands can be MMX registers. The second source operand can be an MMX register
or a 64-bit memory location.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand can be an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM
destination register remain unchanged.
In 64-bit mode, use the REX prefix to access additional registers.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand can be an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM
register are zeroed.
VEX.256 encoded version: Horizontal addition of two adjacent data elements of the low 16-bytes of the first and
second source operands are packed into the low 16-bytes of the destination operand. Horizontal addition of two
adjacent data elements of the high 16-bytes of the first and second source operands are packed into the high 16-
bytes of the destination operand. The first source and destination operands are YMM registers. The second source
operand can be an YMM register or a 256-bit memory location.
SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 X7 X6 X5 X4 X3 X2 X1 X0 SRC1
S7 S3 S3 S4 S3 S2 S1 S0
255 0
Dest
Operation
PHADDW (with 64-bit operands)
mm1[15-0] = mm1[31-16] + mm1[15-0];
mm1[31-16] = mm1[63-48] + mm1[47-32];
mm1[47-32] = mm2/m64[31-16] + mm2/m64[15-0];
mm1[63-48] = mm2/m64[63-48] + mm2/m64[47-32];
Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
NOTES:
1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
Description
(V)PHSUBSW performs horizontal subtraction on each adjacent pair of 16-bit signed integers by subtracting the
most significant word from the least significant word of each pair in the source and destination operands. The
signed, saturated 16-bit results are packed to the destination operand (first operand). When the source operand is
a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception
(#GP) will be generated.
Legacy SSE version: Both operands can be MMX registers. The second source operand can be an MMX register or
a 64-bit memory location.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
In 64-bit mode, use the REX prefix to access additional registers.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed.
VEX.256 encoded version: The first source and destination operands are YMM registers. The second source
operand can be an YMM register or a 256-bit memory location.
Operation
PHSUBSW (with 64-bit operands)
mm1[15-0] = SaturateToSignedWord(mm1[15-0] - mm1[31-16]);
mm1[31-16] = SaturateToSignedWord(mm1[47-32] - mm1[63-48]);
mm1[47-32] = SaturateToSignedWord(mm2/m64[15-0] - mm2/m64[31-16]);
mm1[63-48] = SaturateToSignedWord(mm2/m64[47-32] - mm2/m64[63-48]);
Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
NOTES:
1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
Description
(V)PHSUBW performs horizontal subtraction on each adjacent pair of 16-bit signed integers by subtracting the
most significant word from the least significant word of each pair in the source and destination operands, and packs
the signed 16-bit results to the destination operand (first operand). (V)PHSUBD performs horizontal subtraction on
each adjacent pair of 32-bit signed integers by subtracting the most significant doubleword from the least signifi-
cant doubleword of each pair, and packs the signed 32-bit result to the destination operand. When the source
operand is a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a general-protection
exception (#GP) will be generated.
Legacy SSE version: Both operands can be MMX registers. The second source operand can be an MMX register or a
64-bit memory location.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
In 64-bit mode, use the REX prefix to access additional registers.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed.
VEX.256 encoded version: The first source and destination operands are YMM registers. The second source
operand can be an YMM register or a 256-bit memory location.
Operation
PHSUBW (with 64-bit operands)
mm1[15-0] = mm1[15-0] - mm1[31-16];
mm1[31-16] = mm1[47-32] - mm1[63-48];
mm1[47-32] = mm2/m64[15-0] - mm2/m64[31-16];
mm1[63-48] = mm2/m64[47-32] - mm2/m64[63-48];
Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
0F A9 POP GS ZO Valid Valid Pop top of stack into GS; increment stack
pointer by 16 bits.
0F A9 POP GS ZO N.E. Valid Pop top of stack into GS; increment stack
pointer by 32 bits.
0F A9 POP GS ZO Valid N.E. Pop top of stack into GS; increment stack
pointer by 64 bits.
Description
Loads the value from the top of the stack to the location specified with the destination operand (or explicit opcode)
and then increments the stack pointer. The destination operand can be a general-purpose register, memory loca-
tion, or segment register.
Address and operand sizes are determined and used as follows:
• Address size. The D flag in the current code-segment descriptor determines the default address size; it may be
overridden by an instruction prefix (67H).
The address size is used only when writing to a destination operand in memory.
• Operand size. The D flag in the current code-segment descriptor determines the default operand size; it may
be overridden by instruction prefixes (66H or REX.W).
The operand size (16, 32, or 64 bits) determines the amount by which the stack pointer is incremented (2, 4
or 8).
• Stack-address size. Outside of 64-bit mode, the B flag in the current stack-segment descriptor determines the
size of the stack pointer (16 or 32 bits); in 64-bit mode, the size of the stack pointer is always 64 bits.
The stack-address size determines the width of the stack pointer when reading from the stack in memory and
when incrementing the stack pointer. (As stated above, the amount by which the stack pointer is incremented
is determined by the operand size.)
If the destination operand is one of the segment registers DS, ES, FS, GS, or SS, the value loaded into the register
must be a valid segment selector. In protected mode, popping a segment selector into a segment register automat-
ically causes the descriptor information associated with that segment selector to be loaded into the hidden
(shadow) part of the segment register and causes the selector and the descriptor information to be validated (see
the “Operation” section below).
A NULL value (0000-0003) may be popped into the DS, ES, FS, or GS register without causing a general protection
fault. However, any subsequent attempt to reference a segment whose corresponding segment register is loaded
with a NULL value causes a general protection exception (#GP). In this situation, no memory reference occurs and
the saved value of the segment register is NULL.
The POP instruction cannot pop a value into the CS register. To load the CS register from the stack, use the RET
instruction.
If the ESP register is used as a base register for addressing a destination operand in memory, the POP instruction
computes the effective address of the operand after it increments the ESP register. For the case of a 16-bit stack
where ESP wraps to 0H as a result of the POP instruction, the resulting location of the memory write is processor-
family-specific.
The POP ESP instruction increments the stack pointer (ESP) before data at the old top of stack is written into the
destination.
Loading the SS register with a POP instruction suppresses or inhibits some debug exceptions and inhibits interrupts
on the following instruction boundary. (The inhibition ends after delivery of an exception or the execution of the
next instruction.) This behavior allows a stack pointer to be loaded into the ESP register with the next instruction
(POP ESP) before an event can be delivered. See Section 6.8.3, “Masking Exceptions and Interrupts When
Switching Stacks,” in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A. Intel recom-
mends that software use the LSS instruction to load the SS register and ESP together.
In 64-bit mode, using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). When in
64-bit mode, POPs using 32-bit operands are not encodable and POPs to DS, ES, SS are not valid. See the
summary chart at the beginning of this section for encoding data and limits.
Operation
IF StackAddrSize = 32
THEN
IF OperandSize = 32
THEN
DEST ← SS:ESP; (* Copy a doubleword *)
ESP ← ESP + 4;
ELSE (* OperandSize = 16*)
DEST ← SS:ESP; (* Copy a word *)
ESP ← ESP + 2;
FI;
ELSE IF StackAddrSize = 64
THEN
IF OperandSize = 64
THEN
FI;
Loading a segment register while in protected mode results in special actions, as described in the following listing.
These checks are performed on the segment selector and the segment descriptor it points to.
64-BIT_MODE
IF FS, or GS is loaded with non-NULL selector;
THEN
IF segment selector index is outside descriptor table limits
OR segment is not a data or readable code segment
OR ((segment is a data or nonconforming code segment)
AND ((RPL > DPL) or (CPL > DPL))
THEN #GP(selector);
IF segment not marked present
THEN #NP(selector);
ELSE
SegmentRegister ← segment selector;
SegmentRegister ← segment descriptor;
FI;
FI;
IF FS, or GS is loaded with a NULL selector;
THEN
SegmentRegister ← segment selector;
SegmentRegister ← segment descriptor;
FI;
IF SS is loaded;
THEN
IF segment selector is NULL
THEN #GP(0);
FI;
IF segment selector index is outside descriptor table limits
or segment selector's RPL ≠ CPL
Flags Affected
None.
#SS(0) If the current top of stack is not within the stack segment.
If a memory operand effective address is outside the SS segment limit.
#SS(selector) If the SS register is being loaded and the segment pointed to is marked not present.
#NP If the DS, ES, FS, or GS register is being loaded and the segment pointed to is marked not
present.
#PF(fault-code) If a page fault occurs.
#AC(0) If an unaligned memory reference is made while the current privilege level is 3 and alignment
checking is enabled.
#UD If the LOCK prefix is used.
F3 REX.W 0F 1E /1 (mod=11) R V/N.E. CET_SS Copies shadow stack pointer (SSP) to r64.
RDSSPQ r64
Description
Copies the current shadow stack pointer (SSP) register to the register destination. This opcode is a NOP when CET
shadow stacks are not enabled and on processors that do not support CET.
Operation
IF CPL = 3
IF CR4.CET & IA32_U_CET.SH_STK_EN
IF (operand size is 64 bit)
THEN
Dest ← SSP;
ELSE
Dest ← SSP[31:0];
FI;
FI;
ELSE
IF CR4.CET & IA32_S_CET.SH_STK_EN
IF (operand size is 64 bit)
THEN
Dest ← SSP;
ELSE
Dest ← SSP[31:0];
FI;
FI;
FI;
Flags Affected
None.
Description
Transfers program control to a return address located on the top of the stack. The address is usually placed on the
stack by a CALL instruction, and the return is made to the instruction that follows the CALL instruction.
The optional source operand specifies the number of stack bytes to be released after the return address is popped;
the default is none. This operand can be used to release parameters from the stack that were passed to the called
procedure and are no longer needed. It must be used when the CALL instruction used to switch to a new procedure
uses a call gate with a non-zero word count to access the new procedure. Here, the source operand for the RET
instruction must specify the same number of bytes as is specified in the word count field of the call gate.
The RET instruction can be used to execute three different types of returns:
• Near return — A return to a calling procedure within the current code segment (the segment currently pointed
to by the CS register), sometimes referred to as an intrasegment return.
• Far return — A return to a calling procedure located in a different segment than the current code segment,
sometimes referred to as an intersegment return.
• Inter-privilege-level far return — A far return to a different privilege level than that of the currently
executing program or procedure.
The inter-privilege-level return type can only be executed in protected mode. See the section titled “Calling Proce-
dures Using Call and RET” in Chapter 6 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 1, for detailed information on near, far, and inter-privilege-level returns.
When executing a near return, the processor pops the return instruction pointer (offset) from the top of the stack
into the EIP register and begins program execution at the new instruction pointer. The CS register is unchanged.
When executing a far return, the processor pops the return instruction pointer from the top of the stack into the EIP
register, then pops the segment selector from the top of the stack into the CS register. The processor then begins
program execution in the new code segment at the new instruction pointer.
The mechanics of an inter-privilege-level far return are similar to an intersegment return, except that the
processor examines the privilege levels and access rights of the code and stack segments being returned to deter-
mine if the control transfer is allowed to be made. The DS, ES, FS, and GS segment registers are cleared by the RET
instruction during an inter-privilege-level return if they refer to segments that are not allowed to be accessed at the
new privilege level. Since a stack switch also occurs on an inter-privilege level return, the ESP and SS registers are
loaded from the stack.
If parameters are passed to the called procedure during an inter-privilege level call, the optional source operand
must be used with the RET instruction to release the parameters on the return. Here, the parameters are released
both from the called procedure’s stack and the calling procedure’s stack (that is, the stack being returned to).
In 64-bit mode, the default operation size of this instruction is the stack-address size, i.e. 64 bits. This applies to
near returns, not far returns; the default operation size of far returns is 32 bits.
Refer to Chapter 6, “Procedure Calls, Interrupts, and Exceptions” and Chapter 18, “Control-Flow Enforcement Tech-
nology (CET)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for CET details.
Instruction ordering. Instructions following a far return may be fetched from memory before earlier instructions
complete execution, but they will not execute (even speculatively) until all instructions prior to the far return have
completed execution (the later instructions may execute before data stored by the earlier instructions have become
globally visible).
Unlike near indirect CALL and near indirect JMP, the processor will not speculatively execute the next sequential
instruction after a near RET unless that instruction is also the target of a jump or is a target in a branch predictor.
Operation
(* Near return *)
IF instruction = near return
THEN;
IF OperandSize = 32
THEN
IF top 4 bytes of stack not within stack limits
THEN #SS(0); FI;
EIP ← Pop();
IF ShadowStackEnabled(CPL)
tempSsEIP = ShadowStackPop4B();
IF EIP != TempSsEIP
THEN #CP(NEAR_RET); FI;
FI;
ELSE
IF OperandSize = 64
THEN
IF top 8 bytes of stack not within stack limits
THEN #SS(0); FI;
RIP ← Pop();
IF ShadowStackEnabled(CPL)
tempSsEIP = ShadowStackPop8B();
IF RIP != tempSsEIP
THEN #CP(NEAR_RET); FI;
FI;
ELSE (* OperandSize = 16 *)
IF top 2 bytes of stack not within stack limits
THEN #SS(0); FI;
tempEIP ← Pop();
tempEIP ← tempEIP AND 0000FFFFH;
IF tempEIP not within code segment limits
THEN #GP(0); FI;
EIP ← tempEIP;
IF ShadowStackEnabled(CPL)
tempSsEip = ShadowStackPop4B();
IF EIP != tempSsEIP
THEN #CP(NEAR_RET); FI;
FI;
FI;
FI;
RETURN-TO-SAME-PRIVILEGE-LEVEL:
IF the return instruction pointer is not within the return code segment limit
THEN #GP(0); FI;
IF OperandSize = 32
THEN
EIP ← Pop();
CS ← Pop(); (* 32-bit pop, high-order 16 bits discarded *)
ELSE (* OperandSize = 16 *)
EIP ← Pop();
EIP ← EIP AND 0000FFFFH;
CS ← Pop(); (* 16-bit pop *)
FI;
IF instruction has immediate operand
THEN (* Release parameters from stack *)
IF StackAddressSize = 32
THEN
ESP ← ESP + SRC;
ELSE (* StackAddressSize = 16 *)
SP ← SP + SRC;
FI;
FI;
IF ShadowStackEnabled(CPL)
(* SSP must be 8 byte aligned *)
IF SSP AND 0x7 != 0
THEN #CP(FAR-RET/IRET); FI;
tempSsCS = shadow_stack_load 8 bytes from SSP+16;
tempSsLIP = shadow_stack_load 8 bytes from SSP+8;
prevSSP = shadow_stack_load 8 bytes from SSP;
SSP = SSP + 24;
(* do a 64 bit-compare to check if any bits beyond bit 15 are set *)
tempCS = CS; (* zero pad to 64 bit *)
IF tempCS != tempSsCS
THEN #CP(FAR-RET/IRET); FI;
(* do a 64 bit-compare; pad CSBASE+RIP with 0 for 32 bit LIP*)
IF CSBASE + RIP != tempSsLIP
THEN #CP(FAR-RET/IRET); FI;
(* prevSSP must be 4 byte aligned *)
RETURN-TO-OUTER-PRIVILEGE-LEVEL:
IF top (16 + SRC) bytes of stack are not within stack limits (OperandSize = 32)
or top (8 + SRC) bytes of stack are not within stack limits (OperandSize = 16)
THEN #SS(0); FI;
Read return segment selector;
IF stack segment selector is NULL
THEN #GP(0); FI;
IF return stack segment selector index is not within its descriptor table limits
THEN #GP(selector); FI;
Read segment descriptor pointed to by return segment selector;
IF stack segment selector RPL ≠ RPL of the return code segment selector
or stack segment is not a writable data segment
or stack segment descriptor DPL ≠ RPL of the return code segment selector
THEN #GP(selector); FI;
IF stack segment not present
THEN #SS(StackSegmentSelector); FI;
IF the return instruction pointer is not within the return code segment limit
THEN #GP(0); FI;
CPL ← ReturnCodeSegmentSelector(RPL);
IF OperandSize = 32
THEN
EIP ← Pop();
CS ← Pop(); (* 32-bit pop, high-order 16 bits discarded; segment descriptor loaded *)
CS(RPL) ← CPL;
IF instruction has immediate operand
THEN (* Release parameters from called procedure’s stack *)
IF StackAddressSize = 32
THEN
ESP ← ESP + SRC;
ELSE (* StackAddressSize = 16 *)
SP ← SP + SRC;
FI;
FI;
tempESP ← Pop();
tempSS ← Pop(); (* 32-bit pop, high-order 16 bits discarded; seg. descriptor loaded *)
ESP ← tempESP;
SS ← tempSS;
ELSE (* OperandSize = 16 *)
EIP ← Pop();
EIP ← EIP AND 0000FFFFH;
CS ← Pop(); (* 16-bit pop; segment descriptor loaded *)
CS(RPL) ← CPL;
IF instruction has immediate operand
THEN (* Release parameters from called procedure’s stack *)
IF StackAddressSize = 32
THEN
CPL ← ReturnCodeSegmentSelector(RPL);
ESP ← tempESP;
SS ← tempSS;
tempOldSSP = SSP;
IF ShadowStackEnabled(CPL)
IF CPL = 3
THEN tempSSP ← IA32_PL3_SSP; FI;
IF ((EFER.LMA AND CS.L) = 0 AND tempSSP[63:32] != 0)
THEN #GP(0); FI;
SSP ← tempSSP
FI;
(* Now past all faulting points; safe to free the token. The token free is done using the old SSP
* and using a supervisor override as old CPL was a supervisor privilege level *)
IF ShadowStackEnabled(tempOldCPL)
expected_token_value = tempOldSSP | BUSY_BIT (* busy bit - bit position 0 - must be set *)
new_token_value = tempOldSSP (* clear the busy bit *)
shadow_stack_lock_cmpxchg8b(tempOldSSP, new_token_value, expected_token_value)
FI;
FI;
(* IA-32e Mode *)
IF (PE = 1 and VM = 0 and IA32_EFER.LMA = 1) and instruction = far return
THEN
IF OperandSize = 32
THEN
IF second doubleword on stack is not within stack limits
THEN #SS(0); FI;
IF first or second doubleword on stack is not in canonical space
THEN #SS(0); FI;
ELSE
IF OperandSize = 16
THEN
IF second word on stack is not within stack limits
THEN #SS(0); FI;
IF first or second word on stack is not in canonical space
THEN #SS(0); FI;
ELSE (* OperandSize = 64 *)
IF first or second quadword on stack is not in canonical space
THEN #SS(0); FI;
FI
FI;
IF return code segment selector is NULL
THEN GP(0); FI;
IF return code segment selector addresses descriptor beyond descriptor table limit
THEN GP(selector); FI;
IF return code segment selector addresses descriptor in non-canonical space
THEN GP(selector); FI;
Obtain descriptor to which return code segment selector points from descriptor table;
IF return code segment descriptor is not a code segment
THEN #GP(selector); FI;
IF return code segment descriptor has L-bit = 1 and D-bit = 1
THEN #GP(selector); FI;
IF return code segment selector RPL < CPL
THEN #GP(selector); FI;
IF return code segment descriptor is conforming
and return code segment DPL > return code segment selector RPL
THEN #GP(selector); FI;
IA-32E-MODE-RETURN-TO-SAME-PRIVILEGE-LEVEL:
IF the return instruction pointer is not within the return code segment limit
THEN #GP(0); FI;
IF the return instruction pointer is not within canonical address space
THEN #GP(0); FI;
IF OperandSize = 32
THEN
EIP ← Pop();
CS ← Pop(); (* 32-bit pop, high-order 16 bits discarded *)
ELSE
IF OperandSize = 16
THEN
EIP ← Pop();
EIP ← EIP AND 0000FFFFH;
CS ← Pop(); (* 16-bit pop *)
ELSE (* OperandSize = 64 *)
RIP ← Pop();
CS ← Pop(); (* 64-bit pop, high-order 48 bits discarded *)
FI;
FI;
IF instruction has immediate operand
THEN (* Release parameters from stack *)
IF StackAddressSize = 32
THEN
ESP ← ESP + SRC;
ELSE
IF StackAddressSize = 16
THEN
SP ← SP + SRC;
ELSE (* StackAddressSize = 64 *)
RSP ← RSP + SRC;
FI;
FI;
FI;
IF ShadowStackEnabled(CPL)
IF SSP AND 0x7 != 0 (* check if aligned to 8 bytes *)
THEN #CP(FAR-RET/IRET); FI;
tempSsCS = shadow_stack_load 8 bytes from SSP+16;
tempSsLIP = shadow_stack_load 8 bytes from SSP+8;
tempSSP = shadow_stack_load 8 bytes from SSP;
SSP = SSP + 24;
tempCS = CS; (* zero padded to 64 bit *)
IF tempCS != tempSsCS (* 64 bit compare; CS zero padded to 64 bits *)
IA-32E-MODE-RETURN-TO-OUTER-PRIVILEGE-LEVEL:
IF top (16 + SRC) bytes of stack are not within stack limits (OperandSize = 32)
or top (8 + SRC) bytes of stack are not within stack limits (OperandSize = 16)
THEN #SS(0); FI;
IF top (16 + SRC) bytes of stack are not in canonical address space (OperandSize = 32)
or top (8 + SRC) bytes of stack are not in canonical address space (OperandSize = 16)
or top (32 + SRC) bytes of stack are not in canonical address space (OperandSize = 64)
THEN #SS(0); FI;
Read return stack segment selector;
IF stack segment selector is NULL
THEN
IF new CS descriptor L-bit = 0
THEN #GP(selector);
IF stack segment selector RPL = 3
THEN #GP(selector);
FI;
IF return stack segment descriptor is not within descriptor table limits
THEN #GP(selector); FI;
IF return stack segment descriptor is in non-canonical address space
THEN #GP(selector); FI;
Read segment descriptor pointed to by return segment selector;
IF stack segment selector RPL ≠ RPL of the return code segment selector
or stack segment is not a writable data segment
or stack segment descriptor DPL ≠ RPL of the return code segment selector
THEN #GP(selector); FI;
IF stack segment not present
THEN #SS(StackSegmentSelector); FI;
IF the return instruction pointer is not within the return code segment limit
THEN #GP(0); FI:
IF the return instruction pointer is not within canonical address space
THEN #GP(0); FI;
CPL ← ReturnCodeSegmentSelector(RPL);
IF OperandSize = 32
THEN
EIP ← Pop();
CS ← Pop(); (* 32-bit pop, high-order 16 bits discarded, segment descriptor loaded *)
CS(RPL) ← CPL;
IF instruction has immediate operand
THEN (* Release parameters from called procedure’s stack *)
IF StackAddressSize = 32
THEN
ESP ← ESP + SRC;
ELSE
IF StackAddressSize = 16
THEN
SP ← SP + SRC;
ELSE (* StackAddressSize = 64 *)
RSP ← RSP + SRC;
FI;
FI;
FI;
tempESP ← Pop();
tempSS ← Pop(); (* 32-bit pop, high-order 16 bits discarded, segment descriptor loaded *)
ESP ← tempESP;
SS ← tempSS;
ELSE
IF OperandSize = 16
THEN
EIP ← Pop();
EIP ← EIP AND 0000FFFFH;
CS ← Pop(); (* 16-bit pop; segment descriptor loaded *)
CS(RPL) ← CPL;
IF instruction has immediate operand
THEN (* Release parameters from called procedure’s stack *)
IF StackAddressSize = 32
THEN
ESP ← ESP + SRC;
ELSE
IF StackAddressSize = 16
THEN
SP ← SP + SRC;
ELSE (* StackAddressSize = 64 *)
RSP ← RSP + SRC;
FI;
FI;
FI;
tempESP ← Pop();
tempSS ← Pop(); (* 16-bit pop; segment descriptor loaded *)
ESP ← tempESP;
SS ← tempSS;
ELSE (* OperandSize = 64 *)
RIP ← Pop();
CS ← Pop(); (* 64-bit pop; high-order 48 bits discarded; seg. descriptor loaded *)
CS(RPL) ← CPL;
IF instruction has immediate operand
THEN (* Release parameters from called procedure’s stack *)
RSP ← RSP + SRC;
FI;
tempESP ← Pop();
tempSS ← Pop(); (* 64-bit pop; high-order 48 bits discarded; seg. desc. loaded *)
ESP ← tempESP;
SS ← tempSS;
FI;
FI;
IF ShadowStackEnabled(CPL)
(* check if 8 byte aligned *)
IF SSP AND 0x7 != 0
THEN #CP(FAR-RET/IRET); FI;
IF ReturnCodeSegmentSelector(RPL) !=3
THEN
tempSsCS = shadow_stack_load 8 bytes from SSP+16;
tempSsLIP = shadow_stack_load 8 bytes from SSP+8;
tempSSP = shadow_stack_load 8 bytes from SSP;
SSP = SSP + 24;
(* Do 64 bit compare to detect bits beyond 15 being set *)
tempCS = CS; (* zero padded to 64 bit *)
IF tempCS != tempSsCS
THEN #CP(FAR-RET/IRET); FI;
(* Do 64 bit compare; pad CSBASE+RIP with 0 for 32 bit LIP *)
IF CSBASE + RIP != tempSsLIP
THEN #CP(FAR-RET/IRET); FI;
(* check if 4 byte aligned *)
IF tempSSP AND 0x3 != 0
THEN #CP(FAR-RET/IRET); FI;
FI;
FI;
tempOldCPL = CPL;
CPL ← ReturnCodeSegmentSelector(RPL);
ESP ← tempESP;
SS ← tempSS;
tempOldSSP = SSP;
IF ShadowStackEnabled(CPL)
IF CPL = 3
THEN tempSSP ← IA32_PL3_SSP; FI;
IF ((EFER.LMA AND CS.L) = 0 AND tempSSP[63:32] != 0)
THEN #GP(0); FI;
SSP ← tempSSP
FI;
(* Now past all faulting points; safe to free the token. The token free is done using the old SSP
* and using a supervisor override as old CPL was a supervisor privilege level *)
IF ShadowStackEnabled(tempOldCPL)
expected_token_value = tempOldSSP | BUSY_BIT (* busy bit - bit position 0 - must be set *)
new_token_value = tempOldSSP (* clear the busy bit *)
shadow_stack_lock_cmpxchg8b(tempOldSSP, new_token_value, expected_token_value)
FI;
SP ← SP + SRC;
ELSE (* StackAddressSize = 64 *)
RSP ← RSP + SRC;
FI;
FI;
FI;
Flags Affected
None.
Description
Restores SSP from the shadow-stack-restore token pointed to by m64. If the SSP restore was successful then the
instruction replaces the shadow-stack-restore token with a previous-ssp token. The instruction sets the CF flag to
indicate whether the SSP address recorded in the shadow-stack-restore token that was processed was 4 byte
aligned, i.e., whether an alignment hole was created when the restore-shadow-stack token was pushed on this
shadow stack.
Following RSTORSSP if a restore-shadow-stack token needs to be saved on the previous shadow stack, use the
SAVEPREVSSP instruction.
If pushing a restore-shadow-stack token on the previous shadow stack is not required, the previous-ssp token can
be popped using the INCSSPQ instruction. If the CF flag was set to indicate presence of an alignment hole, an addi-
tional INCSSPD instruction is needed to advance the SSP past the alignment hole.
Operation
IF CPL = 3
IF (CR4.CET & IA32_U_CET.SH_STK_EN) = 0
THEN #UD; FI;
ELSE
IF (CR4.CET & IA32_S_CET.SH_STK_EN) = 0
THEN #UD; FI;
FI;
IF TMP != SSP_LA
THEN fault = 1; FI; (* If address in token does not match the requested top of stack *)
IF fault == 1
THEN #CP(RSTORSSP); FI;
SSP = SSP_LA
// Set the CF if the SSP in the restore token was 4 byte aligned, i.e., there is an alignment hole
RFLAGS.CF = (restore_ssp_token & 0x04) ? 1 : 0;
RFLAGS.ZF,PF,AF,OF,SF ← 0;
Flags Affected
CF is set to indicate if the shadow stack pointer in the restore token was 4 byte aligned, else it is cleared. ZF,
PF, AF, OF, and SF are cleared.
Description
Push a restore-shadow-stack token on the previous shadow stack at the next 8 byte aligned boundary. The
previous SSP is obtained from the previous-ssp token at the top of the current shadow stack.
Operation
IF CPL = 3
IF (CR4.CET & IA32_U_CET.SH_STK_EN) = 0
THEN #UD; FI;
ELSE
IF (CR4.CET & IA32_S_CET.SH_STK_EN) = 0
THEN #UD; FI;
FI;
(* If the CF flag indicates there was a alignment hole on current shadow stack then pop that alignment hole *)
(* Note that the alignment hole must be zero and can be present only when in legacy/compatibility mode *)
IF RFLAGS.CF == 1 AND (EFER.LMA AND CS.L)
#GP(0)
FI;
IF RFLAGS.CF == 1
must_be_zero = ShadowStackPop4B(SSP)
IF must_be_zero != 0 THEN #GP(0)
FI;
(* Save Prev SSP from previous_ssp_token to the old shadow stack at next 8 byte aligned address *)
old_SSP = previous_ssp_token & ~0x03
temp ← (old_SSP | (EFER.LMA & CS.L));
Shadow_stack_store 4 bytes of 0 to (old_SSP - 4)
old_SSP ← old_SSP & ~0x07;
Flags Affected
None.
Description
The SETSSBSY instruction verifies the presence of a non-busy supervisor shadow stack token at the address in the
IA32_PL0_SSP MSR and marks it busy. Following successful execution of the instruction, the SSP is set to the value
of the IA32_PL0_SSP MSR.
Operation
IF (CR4.CET = 0)
THEN #UD; FI;
IF (IA32_S_CET.SH_STK_EN = 0)
THEN #UD; FI;
IF CPL > 0
THEN GP(0); FI;
SSP_LA = IA32_PL0_SSP
If SSP_LA not aligned to 8 bytes
THEN #GP(0); FI;
Flags Affected
None.
Description
The SHA256RNDS2 instruction performs 2 rounds of SHA256 operation using an initial SHA256 state (C,D,G,H)
from the first operand, an initial SHA256 state (A,B,E,F) from the second operand, and a pre-computed sum of the
next 2 round message dwords and the corresponding round constants from the implicit operand xmm0. Note that
only the two lower dwords of XMM0 are used by the instruction.
The updated SHA256 state (A,B,E,F) is written to the first operand, and the second operand can be used as the
updated state (C,D,G,H) in later rounds.
Operation
SHA256RNDS2
A_0 SRC2[127:96];
B_0 SRC2[95:64];
C_0 SRC1[127:96];
D_0 SRC1[95:64];
E_0 SRC2[63:32];
F_0 SRC2[31:0];
G_0 SRC1[63:32];
H_0 SRC1[31:0];
WK0 XMM0[31: 0];
WK1 XMM0[63: 32];
FOR i = 0 to 1
A_(i +1) Ch (E_i, F_i, G_i) +Σ1( E_i) +WKi+ H_i + Maj(A_i , B_i, C_i) +Σ0( A_i);
B_(i +1) A_i;
C_(i +1) B_i ;
D_(i +1) C_i;
E_(i +1) Ch (E_i, F_i, G_i) +Σ1( E_i) +WKi+ H_i + D_i;
F_(i +1) E_i ;
G_(i +1) F_i;
H_(i +1) G_i;
ENDFOR
DEST[127:96] A_2;
DEST[95:64] B_2;
DEST[63:32] E_2;
DEST[31:0] F_2;
Flags Affected
None
None
Other Exceptions
See Exceptions Type 4.
Description
SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR
MSR (after saving the address of the instruction following SYSCALL into RCX). (The WRMSR instruction ensures
that the IA32_LSTAR MSR always contain a canonical address.)
SYSCALL also saves RFLAGS into R11 and then masks RFLAGS using the IA32_FMASK MSR (MSR address
C0000084H); specifically, the processor clears in RFLAGS every bit corresponding to a bit that is set in the
IA32_FMASK MSR.
SYSCALL loads the CS and SS selectors with values derived from bits 47:32 of the IA32_STAR MSR. However, the
CS and SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors.
Instead, the descriptor caches are loaded with fixed values. See the Operation section for details. It is the respon-
sibility of OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values corre-
spond to the fixed values loaded into the descriptor caches; the SYSCALL instruction does not ensure this
correspondence.
The SYSCALL instruction does not save the stack pointer (RSP). If the OS system-call handler will change the stack
pointer, it is the responsibility of software to save the previous value of the stack pointer. This might be done prior
to executing SYSCALL, with software restoring the stack pointer with the instruction following SYSCALL (which will
be executed after SYSRET). Alternatively, the OS system-call handler may save the stack pointer and restore it
before executing SYSRET.
When shadow stacks are enabled at a privilege level where the SYSCALL instruction is invoked, the SSP is saved to
the IA32_PL3_SSP MSR. If shadow stacks are enabled at privilege level 0, the SSP is loaded with 0. Refer to
Chapter 6, “Procedure Calls, Interrupts, and Exceptions” and Chapter 18, “Control-Flow Enforcement Technology
(CET)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for additional CET details.
Instruction ordering. Instructions following a SYSCALL may be fetched from memory before earlier instructions
complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have
completed execution (the later instructions may execute before data stored by the earlier instructions have become
globally visible).
Operation
IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
(* Not in 64-Bit Mode or SYSCALL/SYSRET not enabled in IA32_EFER *)
THEN #UD;
FI;
CS.Selector ← IA32_STAR[47:32] AND FFFCH (* Operating system provides CS; RPL forced to 0 *)
(* Set rest of CS to a fixed value *)
CS.Base ← 0; (* Flat segment *)
IF ShadowStackEnabled(CPL)
IA32_PL3_SSP ← SSP; (* With shadow stacks enabled the system call is supported from Ring 3 to Ring 0 *)
(* OS supporting Ring 0 to Ring 0 system calls or Ring 1/2 to ring 0 system call *)
(* Must preserve the contents of IA32_PL3_SSP to avoid losing ring 3 state *)
FI;
CPL ← 0;
IF ShadowStackEnabled(CPL)
SSP ← 0;
FI;
IF EndbranchEnabled(CPL)
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_S_CET.SUPPRESS = 0
FI;
Flags Affected
All.
Description
Executes a fast call to a level 0 system procedure or routine. SYSENTER is a companion instruction to SYSEXIT. The
instruction is optimized to provide the maximum performance for system calls from user code running at privilege
level 3 to operating system or executive procedures running at privilege level 0.
When executed in IA-32e mode, the SYSENTER instruction transitions the logical processor to 64-bit mode; other-
wise, the logical processor remains in protected mode.
Prior to executing the SYSENTER instruction, software must specify the privilege level 0 code segment and code
entry point, and the privilege level 0 stack segment and stack pointer by writing values to the following MSRs:
• IA32_SYSENTER_CS (MSR address 174H) — The lower 16 bits of this MSR are the segment selector for the
privilege level 0 code segment. This value is also used to determine the segment selector of the privilege level
0 stack segment (see the Operation section). This value cannot indicate a null selector.
• IA32_SYSENTER_EIP (MSR address 176H) — The value of this MSR is loaded into RIP (thus, this value
references the first instruction of the selected operating procedure or routine). In protected mode, only
bits 31:0 are loaded.
• IA32_SYSENTER_ESP (MSR address 175H) — The value of this MSR is loaded into RSP (thus, this value
contains the stack pointer for the privilege level 0 stack). This value cannot represent a non-canonical address.
In protected mode, only bits 31:0 are loaded.
These MSRs can be read from and written to using RDMSR/WRMSR. The WRMSR instruction ensures that the
IA32_SYSENTER_EIP and IA32_SYSENTER_ESP MSRs always contain canonical addresses.
While SYSENTER loads the CS and SS selectors with values derived from the IA32_SYSENTER_CS MSR, the CS and
SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors. Instead,
the descriptor caches are loaded with fixed values. See the Operation section for details. It is the responsibility of
OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values correspond to the
fixed values loaded into the descriptor caches; the SYSENTER instruction does not ensure this correspondence.
The SYSENTER instruction can be invoked from all operating modes except real-address mode.
The SYSENTER and SYSEXIT instructions are companion instructions, but they do not constitute a call/return pair.
When executing a SYSENTER instruction, the processor does not save state information for the user code (e.g., the
instruction pointer), and neither the SYSENTER nor the SYSEXIT instruction supports passing parameters on the
stack.
To use the SYSENTER and SYSEXIT instructions as companion instructions for transitions between privilege level 3
code and privilege level 0 operating system procedures, the following conventions must be followed:
• The segment descriptors for the privilege level 0 code and stack segments and for the privilege level 3 code and
stack segments must be contiguous in a descriptor table. This convention allows the processor to compute the
segment selectors from the value entered in the SYSENTER_CS_MSR MSR.
• The fast system call “stub” routines executed by user code (typically in shared libraries or DLLs) must save the
required return IP and processor state information if a return to the calling procedure is required. Likewise, the
operating system or executive procedures called with SYSENTER instructions must have access to and use this
saved return and state information when returning to the user code.
The SYSENTER and SYSEXIT instructions were introduced into the IA-32 architecture in the Pentium II processor.
The availability of these instructions on a processor is indicated with the SYSENTER/SYSEXIT present (SEP) feature
flag returned to the EDX register by the CPUID instruction. An operating system that qualifies the SEP flag must
also qualify the processor family and model to ensure that the SYSENTER/SYSEXIT instructions are actually
present. For example:
Operation
IF CR0.PE = 0 OR IA32_SYSENTER_CS[15:2] = 0 THEN #GP(0); FI;
FI;
CS.G ← 1; (* 4-KByte granularity *)
IF ShadowStackEnabled(CPL)
IA32_PL3_SSP ← SSP;
FI;
CPL ← 0;
IF ShadowStackEnabled(CPL)
SSP ← 0;
FI;
IF EndbranchEnabled(CPL)
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_S_CET.SUPPRESS = 0
FI;
Flags Affected
VM, IF (see Operation above)
Description
Executes a fast return to privilege level 3 user code. SYSEXIT is a companion instruction to the SYSENTER instruc-
tion. The instruction is optimized to provide the maximum performance for returns from system procedures
executing at protections levels 0 to user procedures executing at protection level 3. It must be executed from code
executing at privilege level 0.
With a 64-bit operand size, SYSEXIT remains in 64-bit mode; otherwise, it either enters compatibility mode (if the
logical processor is in IA-32e mode) or remains in protected mode (if it is not).
Prior to executing SYSEXIT, software must specify the privilege level 3 code segment and code entry point, and the
privilege level 3 stack segment and stack pointer by writing values into the following MSR and general-purpose
registers:
• IA32_SYSENTER_CS (MSR address 174H) — Contains a 32-bit value that is used to determine the segment
selectors for the privilege level 3 code and stack segments (see the Operation section)
• RDX — The canonical address in this register is loaded into RIP (thus, this value references the first instruction
to be executed in the user code). If the return is not to 64-bit mode, only bits 31:0 are loaded.
• ECX — The canonical address in this register is loaded into RSP (thus, this value contains the stack pointer for
the privilege level 3 stack). If the return is not to 64-bit mode, only bits 31:0 are loaded.
The IA32_SYSENTER_CS MSR can be read from and written to using RDMSR and WRMSR.
While SYSEXIT loads the CS and SS selectors with values derived from the IA32_SYSENTER_CS MSR, the CS and
SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors. Instead,
the descriptor caches are loaded with fixed values. See the Operation section for details. It is the responsibility of
OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values correspond to the
fixed values loaded into the descriptor caches; the SYSEXIT instruction does not ensure this correspondence.
The SYSEXIT instruction can be invoked from all operating modes except real-address mode and virtual-8086
mode.
The SYSENTER and SYSEXIT instructions were introduced into the IA-32 architecture in the Pentium II processor.
The availability of these instructions on a processor is indicated with the SYSENTER/SYSEXIT present (SEP) feature
flag returned to the EDX register by the CPUID instruction. An operating system that qualifies the SEP flag must
also qualify the processor family and model to ensure that the SYSENTER/SYSEXIT instructions are actually
present. For example:
When shadow stacks are enabled at privilege level 3 the instruction loads SSP with value from IA32_PL3_SSP MSR.
Refer to Chapter 6, “Procedure Calls, Interrupts, and Exceptions” and Chapter 18, “Control-Flow Enforcement
Technology (CET)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for additional
CET details.
Instruction ordering. Instructions following a SYSEXIT may be fetched from memory before earlier instructions
complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSEXIT have
completed execution (the later instructions may execute before data stored by the earlier instructions have
become globally visible).
Operation
IF IA32_SYSENTER_CS[15:2] = 0 OR CR0.PE = 0 OR CPL ≠ 0 THEN #GP(0); FI;
IF ShadowStackEnabled(CPL)
SSP ← IA32_PL3_SSP;
FI;
SS.P ← 1;
SS.B ← 1; (* 32-bit stack segment*)
SS.G ← 1; (* 4-KByte granularity *)
Flags Affected
None.
Description
SYSRET is a companion instruction to the SYSCALL instruction. It returns from an OS system-call handler to user
code at privilege level 3. It does so by loading RIP from RCX and loading RFLAGS from R11.1 With a 64-bit operand
size, SYSRET remains in 64-bit mode; otherwise, it enters compatibility mode and only the low 32 bits of the regis-
ters are loaded.
SYSRET loads the CS and SS selectors with values derived from bits 63:48 of the IA32_STAR MSR. However, the
CS and SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors.
Instead, the descriptor caches are loaded with fixed values. See the Operation section for details. It is the respon-
sibility of OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values corre-
spond to the fixed values loaded into the descriptor caches; the SYSRET instruction does not ensure this
correspondence.
The SYSRET instruction does not modify the stack pointer (ESP or RSP). For that reason, it is necessary for soft-
ware to switch to the user stack. The OS may load the user stack pointer (if it was saved after SYSCALL) before
executing SYSRET; alternatively, user code may load the stack pointer (if it was saved before SYSCALL) after
receiving control from SYSRET.
If the OS loads the stack pointer before executing SYSRET, it must ensure that the handler of any interrupt or
exception delivered between restoring the stack pointer and successful execution of SYSRET is not invoked with the
user stack. It can do so using approaches such as the following:
• External interrupts. The OS can prevent an external interrupt from being delivered by clearing EFLAGS.IF
before loading the user stack pointer.
• Nonmaskable interrupts (NMIs). The OS can ensure that the NMI handler is invoked with the correct stack by
using the interrupt stack table (IST) mechanism for gate 2 (NMI) in the IDT (see Section 6.14.5, “Interrupt
Stack Table,” in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A).
• General-protection exceptions (#GP). The SYSRET instruction generates #GP(0) if the value of RCX is not
canonical. The OS can address this possibility using one or more of the following approaches:
— Confirming that the value of RCX is canonical before executing SYSRET.
— Using paging to ensure that the SYSCALL instruction will never save a non-canonical value into RCX.
— Using the IST mechanism for gate 13 (#GP) in the IDT.
When shadow stacks are enabled at privilege level 3 the instruction loads SSP with value from IA32_PL3_SSP MSR.
Refer to Chapter 6, “Procedure Calls, Interrupts, and Exceptions” and Chapter 18, “Control-Flow Enforcement
Technology (CET)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for additional
CET details.
1. Regardless of the value of R11, the RF and VM flags are always 0 in RFLAGS after execution of SYSRET. In addition, all reserved bits
in RFLAGS retain the fixed values.
Instruction ordering. Instructions following a SYSRET may be fetched from memory before earlier instructions
complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSRET have
completed execution (the later instructions may execute before data stored by the earlier instructions have become
globally visible).
Operation
IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
(* Not in 64-Bit Mode or SYSCALL/SYSRET not enabled in IA32_EFER *)
THEN #UD; FI;
IF (CPL ≠ 0) THEN #GP(0); FI;
Flags Affected
All.
------------------------------------------------------------------------------------------
Changes to this chapter:
Description
Compress (stores) up to 64 byte values or 32 word values from the source operand (second operand) to the desti-
nation operand (first operand), based on the active elements determined by the writemask operand. Note:
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Moves up to 512 bits of packed byte values from the source operand (second operand) to the destination operand
(first operand). This instruction is used to store partial contents of a vector register into a byte vector or single
memory location using the active elements in operand writemask.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
This instruction supports memory fault suppression.
5-338 Vol. 2C VPCOMPRESSB/VCOMPRESSW —Store Sparse Packed Byte/Word Integer Values into Dense Memory/Register
INSTRUCTION SET REFERENCE, V-Z
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
Operation
VPCOMPRESSB store form
(KL, VL) = (16, 128), (32, 256), (64, 512)
k←0
FOR j ← 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.byte[k] ← SRC.byte[j]
k ← k +1
VPCOMPRESSB/VCOMPRESSW —Store Sparse Packed Byte/Word Integer Values into Dense Memory/Register Vol. 2C 5-339
INSTRUCTION SET REFERENCE, V-Z
Other Exceptions
See Exceptions Type E4.
5-340 Vol. 2C VPCOMPRESSB/VCOMPRESSW —Store Sparse Packed Byte/Word Integer Values into Dense Memory/Register
INSTRUCTION SET REFERENCE, V-Z
Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand.
This instruction supports memory fault suppression.
Operation
VPDPBUSD dest, src1, src2
(KL,VL)=(4,128), (8,256), (16,512)
ORIGDEST ← DEST
FOR i ← 0 TO KL-1:
IF k1[i] or *no writemask*:
// Byte elements of SRC1 are zero-extended to 16b and
// byte elements of SRC2 are sign extended to 16b before multiplication.
IF SRC2 is memory and EVEX.b == 1:
t ← SRC2.dword[0]
ELSE:
t ← SRC2.dword[i]
p1word ← ZERO_EXTEND(SRC1.byte[4*i]) * SIGN_EXTEND(t.byte[0])
p2word ← ZERO_EXTEND(SRC1.byte[4*i+1]) * SIGN_EXTEND(t.byte[1])
p3word ← ZERO_EXTEND(SRC1.byte[4*i+2]) * SIGN_EXTEND(t.byte[2])
p4word ← ZERO_EXTEND(SRC1.byte[4*i+3]) * SIGN_EXTEND(t.byte[3])
DEST.dword[i] ← ORIGDEST.dword[i] + p1word + p2word + p3word + p4word
ELSE IF *zeroing*:
DEST.dword[i] ← 0
ELSE: // Merge masking, dest element unchanged
DEST.dword[i] ← ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] ← 0
5-348 Vol. 2C VPDPBUSD — Multiply and Add Unsigned and Signed Bytes
INSTRUCTION SET REFERENCE, V-Z
Other Exceptions
See Exceptions Type E4.
VPDPBUSD — Multiply and Add Unsigned and Signed Bytes Vol. 2C 5-349
INSTRUCTION SET REFERENCE, V-Z
VPDPBUSDS — Multiply and Add Unsigned and Signed Bytes with Saturation
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 51 /r A V/V AVX512_VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS xmm1{k1}{z}, xmm2, AVX512VL xmm3/m128/m32bcst with corresponding
xmm3/m128/m32bcst unsigned bytes of xmm2, summing those
products and adding them to doubleword
result, with signed saturation in xmm1, under
writemask k1.
EVEX.256.66.0F38.W0 51 /r A V/V AVX512_VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS ymm1{k1}{z}, ymm2, AVX512VL ymm3/m256/m32bcst with corresponding
ymm3/m256/m32bcst unsigned bytes of ymm2, summing those
products and adding them to doubleword
result, with signed saturation in ymm1, under
writemask k1.
EVEX.512.66.0F38.W0 51 /r A V/V AVX512_VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst with corresponding
zmm3/m512/m32bcst unsigned bytes of zmm2, summing those
products and adding them to doubleword
result, with signed saturation in zmm1, under
writemask k1.
Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand. If the intermediate sum overflows a 32b signed number the result
is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.
This instruction supports memory fault suppression.
Operation
VPDPBUSDS dest, src1, src2
(KL,VL)=(4,128), (8,256), (16,512)
ORIGDEST ← DEST
FOR i ← 0 TO KL-1:
IF k1[i] or *no writemask*:
// Byte elements of SRC1 are zero-extended to 16b and
// byte elements of SRC2 are sign extended to 16b before multiplication.
IF SRC2 is memory and EVEX.b == 1:
t ← SRC2.dword[0]
ELSE:
t ← SRC2.dword[i]
p1word ← ZERO_EXTEND(SRC1.byte[4*i]) * SIGN_EXTEND(t.byte[0])
p2word ← ZERO_EXTEND(SRC1.byte[4*i+1]) * SIGN_EXTEND(t.byte[1])
p3word ← ZERO_EXTEND(SRC1.byte[4*i+2]) * SIGN_EXTEND(t.byte[2])
p4word ← ZERO_EXTEND(SRC1.byte[4*i+3]) *SIGN_EXTEND(t.byte[3])
DEST.dword[i] ← SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1word + p2word + p3word + p4word)
5-350 Vol. 2C VPDPBUSDS — Multiply and Add Unsigned and Signed Bytes with Saturation
INSTRUCTION SET REFERENCE, V-Z
ELSE IF *zeroing*:
DEST.dword[i] ← 0
ELSE: // Merge masking, dest element unchanged
DEST.dword[i] ← ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] ← 0
Other Exceptions
See Exceptions Type E4.
VPDPBUSDS — Multiply and Add Unsigned and Signed Bytes with Saturation Vol. 2C 5-351
INSTRUCTION SET REFERENCE, V-Z
Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand.
This instruction supports memory fault suppression.
Operation
VPDPWSSD dest, src1, src2
(KL,VL)=(4,128), (8,256), (16,512)
ORIGDEST ← DEST
FOR i ← 0 TO KL-1:
IF k1[i] or *no writemask*:
IF SRC2 is memory and EVEX.b == 1:
t ← SRC2.dword[0]
ELSE:
t ← SRC2.dword[i]
p1dword ← SRC1.word[2*i] * t.word[0]
p2dword ← SRC1.word[2*i+1] * t.word[1]
DEST.dword[i] ← ORIGDEST.dword[i] + p1dword + p2dword
ELSE IF *zeroing*:
DEST.dword[i] ← 0
ELSE: // Merge masking, dest element unchanged
DEST.dword[i] ← ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] ← 0
Other Exceptions
See Exceptions Type E4.
Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand. If the intermediate sum overflows a 32b signed number, the
result is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.
This instruction supports memory fault suppression.
Operation
VPDPWSSDS dest, src1, src2
(KL,VL)=(4,128), (8,256), (16,512)
ORIGDEST ← DEST
FOR i ← 0 TO KL-1:
IF k1[i] or *no writemask*:
IF SRC2 is memory and EVEX.b == 1:
t ← SRC2.dword[0]
ELSE:
t ← SRC2.dword[i]
p1dword ← SRC1.word[2*i] * t.word[0]
p2dword ← SRC1.word[2*i+1] * t.word[1]
DEST.dword[i] ← SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1dword + p2dword)
ELSE IF *zeroing*:
DEST.dword[i] ← 0
ELSE: // Merge masking, dest element unchanged
DEST.dword[i] ← ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] ← 0
5-354 Vol. 2C VPDPWSSDS — Multiply and Add Signed Word Integers with Saturation
INSTRUCTION SET REFERENCE, V-Z
Other Exceptions
See Exceptions Type E4.
VPDPWSSDS — Multiply and Add Signed Word Integers with Saturation Vol. 2C 5-355
INSTRUCTION SET REFERENCE, V-Z
Description
Expands (loads) up to 64 byte integer values or 32 word integer values from the source operand (memory
operand) to the destination operand (register operand), based on the active elements determined by the
writemask operand.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Moves 128, 256 or 512 bits of packed byte integer values from the source operand (memory operand) to the desti-
nation operand (register operand). This instruction is used to load from an int8 vector register or memory location
while inserting the data into sparse elements of destination vector register using the active elements pointed out
by the operand writemask.
This instruction supports memory fault suppression.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
Operation
VPEXPANDB
(KL, VL) = (16, 128), (32, 256), (64, 512)
k←0
FOR j ← 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.byte[j] ← SRC.byte[k];
k←k+1
ELSE:
IF *merging-masking*:
*DEST.byte[j] remains unchanged*
ELSE: ; zeroing-masking
DEST.byte[j] ← 0
DEST[MAX_VL-1:VL] ← 0
VPEXPANDW
(KL, VL) = (8,128), (16,256), (32, 512)
k←0
FOR j ← 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.word[j] ← SRC.word[k];
k←k+1
ELSE:
IF *merging-masking*:
*DEST.word[j] remains unchanged*
ELSE: ; zeroing-masking
DEST.word[j] ← 0
DEST[MAX_VL-1:VL] ← 0
Other Exceptions
See Exceptions Type E4.
Description
This instruction counts the number of bits set to one in each byte, word, dword or qword element of its source (e.g.,
zmm2 or memory) and places the results in the destination register (zmm1). This instruction supports memory
fault suppression.
5-462 Vol. 2C VPOPCNT — Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD
INSTRUCTION SET REFERENCE, V-Z
Operation
VPOPCNTB
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j ← 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.byte[j] ← POPCNT(SRC.byte[j])
ELSE IF *merging-masking*:
*DEST.byte[j] remains unchanged*
ELSE:
DEST.byte[j] ← 0
DEST[MAX_VL-1:VL] ← 0
VPOPCNTW
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j ← 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.word[j] ← POPCNT(SRC.word[j])
ELSE IF *merging-masking*:
*DEST.word[j] remains unchanged*
ELSE:
DEST.word[j] ← 0
DEST[MAX_VL-1:VL] ← 0
VPOPCNTD
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j ← 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
IF SRC is broadcast memop:
t ← SRC.dword[0]
ELSE:
t ← SRC.dword[j]
DEST.dword[j] ← POPCNT(t)
ELSE IF *merging-masking*:
*DEST..dword[j] remains unchanged*
ELSE:
DEST..dword[j] ← 0
DEST[MAX_VL-1:VL] ← 0
VPOPCNTQ
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j ← 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
IF SRC is broadcast memop:
t ← SRC.qword[0]
ELSE:
t ← SRC.qword[j]
DEST.qword[j] ← POPCNT(t)
ELSE IF *merging-masking*:
*DEST..qword[j] remains unchanged*
ELSE:
DEST..qword[j] ← 0
DEST[MAX_VL-1:VL] ← 0
VPOPCNT — Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD Vol. 2C 5-463
INSTRUCTION SET REFERENCE, V-Z
Other Exceptions
5-464 Vol. 2C VPOPCNT — Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD
INSTRUCTION SET REFERENCE, V-Z
Description
Concatenate packed data, extract result shifted to the left by constant value.
This instruction supports memory fault suppression.
VPSHLD — Concatenate and Shift Packed Data Left Logical Vol. 2C 5-479
INSTRUCTION SET REFERENCE, V-Z
Operation
VPSHLDW DEST, SRC2, SRC3, imm8
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j ← 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
tmp ← concat(SRC2.word[j], SRC3.word[j]) << (imm8 & 15)
DEST.word[j] ← tmp.word[1]
ELSE IF *zeroing*:
DEST.word[j] ← 0
*ELSE DEST.word[j] remains unchanged*
DEST[MAX_VL-1:VL] ← 0
5-480 Vol. 2C VPSHLD — Concatenate and Shift Packed Data Left Logical
INSTRUCTION SET REFERENCE, V-Z
Other Exceptions
See Type E4.
VPSHLD — Concatenate and Shift Packed Data Left Logical Vol. 2C 5-481
INSTRUCTION SET REFERENCE, V-Z
Description
Concatenate packed data, extract result shifted to the left by variable value.
This instruction supports memory fault suppression.
5-482 Vol. 2C VPSHLDV — Concatenate and Variable Shift Packed Data Left Logical
INSTRUCTION SET REFERENCE, V-Z
Operation
FUNCTION concat(a,b):
IF words:
d.word[1] ← a
d.word[0] ← b
return d
ELSE IF dwords:
q.dword[1] ← a
q.dword[0] ← b
return q
ELSE IF qwords:
o.qword[1] ← a
o.qword[0] ← b
return o
VPSHLDV — Concatenate and Variable Shift Packed Data Left Logical Vol. 2C 5-483
INSTRUCTION SET REFERENCE, V-Z
Other Exceptions
See Type E4.
5-484 Vol. 2C VPSHLDV — Concatenate and Variable Shift Packed Data Left Logical
INSTRUCTION SET REFERENCE, V-Z
Description
Concatenate packed data, extract result shifted to the right by constant value.
This instruction supports memory fault suppression.
VPSHRD — Concatenate and Shift Packed Data Right Logical Vol. 2C 5-485
INSTRUCTION SET REFERENCE, V-Z
Operation
VPSHRDW DEST, SRC2, SRC3, imm8
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j ← 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.word[j] ← concat(SRC3.word[j], SRC2.word[j]) >> (imm8 & 15)
ELSE IF *zeroing*:
DEST.word[j] ← 0
*ELSE DEST.word[j] remains unchanged*
DEST[MAX_VL-1:VL] ← 0
5-486 Vol. 2C VPSHRD — Concatenate and Shift Packed Data Right Logical
INSTRUCTION SET REFERENCE, V-Z
Other Exceptions
See Type E4.
VPSHRD — Concatenate and Shift Packed Data Right Logical Vol. 2C 5-487
INSTRUCTION SET REFERENCE, V-Z
Description
Concatenate packed data, extract result shifted to the right by variable value.
This instruction supports memory fault suppression.
5-488 Vol. 2C VPSHRDV — Concatenate and Variable Shift Packed Data Right Logical
INSTRUCTION SET REFERENCE, V-Z
Operation
VPSHRDVW DEST, SRC2, SRC3
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j ← 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.word[j] ← concat(SRC2.word[j], DEST.word[j]) >> (SRC3.word[j] & 15)
ELSE IF *zeroing*:
DEST.word[j] ← 0
*ELSE DEST.word[j] remains unchanged*
DEST[MAX_VL-1:VL] ← 0
VPSHRDV — Concatenate and Variable Shift Packed Data Right Logical Vol. 2C 5-489
INSTRUCTION SET REFERENCE, V-Z
Other Exceptions
See Type E4.
5-490 Vol. 2C VPSHRDV — Concatenate and Variable Shift Packed Data Right Logical
INSTRUCTION SET REFERENCE, V-Z
VPSHUFBITQMB — Shuffle Bits from Quadword Elements Using Byte Indexes into Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 8F /r A V/V AVX512_BITALG Extract values in xmm2 using control bits of
VPSHUFBITQMB k1{k2}, xmm2, AVX512VL xmm3/m128 with writemask k2 and leave the
xmm3/m128 result in mask register k1.
EVEX.256.66.0F38.W0 8F /r A V/V AVX512_BITALG Extract values in ymm2 using control bits of
VPSHUFBITQMB k1{k2}, ymm2, AVX512VL ymm3/m256 with writemask k2 and leave the
ymm3/m256 result in mask register k1.
EVEX.512.66.0F38.W0 8F /r A V/V AVX512_BITALG Extract values in zmm2 using control bits of
VPSHUFBITQMB k1{k2}, zmm2, zmm3/m512 with writemask k2 and leave the
zmm3/m512 result in mask register k1.
Description
The VPSHUFBITQMB instruction performs a bit gather select using second source as control and first source as
data. Each bit uses 6 control bits (2nd source operand) to select which data bit is going to be gathered (first source
operand). A given bit can only access 64 different bits of data (first 64 destination bits can access first 64 data bits,
second 64 destination bits can access second 64 data bits, etc.).
Control data for each output bit is stored in 8 bit elements of SRC2, but only the 6 least significant bits of each
element are used.
This instruction uses write masking (zeroing only). This instruction supports memory fault suppression.
The first source operand is a ZMM register. The second source operand is a ZMM register or a memory location. The
destination operand is a mask register.
Operation
VPSHUFBITQMB DEST, SRC1, SRC2
(KL, VL) = (16,128), (32,256), (64, 512)
FOR i ← 0 TO KL/8-1: //Qword
FOR j ← 0 to 7: // Byte
IF k2[i*8+j] or *no writemask*:
m ← SRC2.qword[i].byte[j] & 0x3F
k1[i*8+j] ← SRC1.qword[i].bit[m]
ELSE:
k1[i*8+j] ← 0
k1[MAX_KL-1:KL] ← 0
VPSHUFBITQMB — Shuffle Bits from Quadword Elements Using Byte Indexes into Mask Vol. 2C 5-491
INSTRUCTION SET REFERENCE, V-Z
Description
Perform reduction transformation of the packed binary encoded double-precision FP values in the source operand
(the second operand) and store the reduced results in binary FP format to the destination operand (the first
operand) under the writemask k1.
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source
value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduction transfor-
mation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary FP numbers with normalized significand and bi-
ased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e. Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
7 6 5 4 3 2 1 0
Operation
ReduceArgumentDP(SRC[63:0], imm8[7:0])
{
// Check for NaN
IF (SRC [63:0] = NAN) THEN
RETURN (Convert SRC[63:0] to QNaN); FI;
M imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted
RC imm8[1:0];// Round Control for ROUND() operation
RC source imm[2];
SPE imm[3];// Suppress Precision Exception
TMP[63:0] 2-M *{ROUND(2M*SRC[63:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values
TMP[63:0] SRC[63:0] – TMP[63:0]; // subtraction under the same RC,SPE controls
RETURN TMP[63:0]; // binary encoded FP with biased exponent and normalized significand
}
VREDUCEPD
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j 0 TO KL-1
i j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC *is memory*)
THEN DEST[i+63:i] ReduceArgumentDP(SRC[63:0], imm8[7:0]);
ELSE DEST[i+63:i] ReduceArgumentDP(SRC[i+63:i], imm8[7:0]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] 0
Invalid, Precision
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
Other Exceptions
See Exceptions Type E2, additionally
#UD If EVEX.vvvv != 1111B.
Description
Perform reduction transformation of the packed binary encoded single-precision FP values in the source operand
(the second operand) and store the reduced results in binary FP format to the destination operand (the first
operand) under the writemask k1.
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source
value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduction transfor-
mation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary FP numbers with normalized significand and bi-
ased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e. Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Handling of special case of input values are listed in Table 5-15.
Operation
ReduceArgumentSP(SRC[31:0], imm8[7:0])
{
// Check for NaN
IF (SRC [31:0] = NAN) THEN
RETURN (Convert SRC[31:0] to QNaN); FI
M imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted
RC imm8[1:0];// Round Control for ROUND() operation
RC source imm[2];
SPE imm[3];// Suppress Precision Exception
TMP[31:0] 2-M *{ROUND(2M*SRC[31:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values
TMP[31:0] SRC[31:0] – TMP[31:0]; // subtraction under the same RC,SPE controls
RETURN TMP[31:0]; // binary encoded FP with biased exponent and normalized significand
}
VREDUCEPS
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j 0 TO KL-1
i j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC *is memory*)
THEN DEST[i+31:i] ReduceArgumentSP(SRC[31:0], imm8[7:0]);
ELSE DEST[i+31:i] ReduceArgumentSP(SRC[i+31:i], imm8[7:0]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] = 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] 0
Invalid, Precision
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
Other Exceptions
See Exceptions Type E2, additionally
#UD If EVEX.vvvv != 1111B.
Description
Perform a reduction transformation of the binary encoded double-precision FP value in the low qword element of
the second source operand (the third operand) and store the reduced result in binary FP format to the low qword
element of the destination operand (the first operand) under the writemask k1. Bits 127:64 of the destination
operand are copied from respective qword elements of the first source operand (the second operand).
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source
value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduction transfor-
mation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary FP numbers with normalized significand and bi-
ased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e. Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
The operation is write masked.
Handling of special case of input values are listed in Table 5-15.
Operation
ReduceArgumentDP(SRC[63:0], imm8[7:0])
{
// Check for NaN
IF (SRC [63:0] = NAN) THEN
RETURN (Convert SRC[63:0] to QNaN); FI;
M imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted
RC imm8[1:0];// Round Control for ROUND() operation
RC source imm[2];
SPE imm[3];// Suppress Precision Exception
TMP[63:0] 2-M *{ROUND(2M*SRC[63:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values
TMP[63:0] SRC[63:0] – TMP[63:0]; // subtraction under the same RC,SPE controls
RETURN TMP[63:0]; // binary encoded FP with biased exponent and normalized significand
}
VREDUCESD
IF k1[0] or *no writemask*
THEN DEST[63:0] ReduceArgumentDP(SRC2[63:0], imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] = 0
FI;
FI;
DEST[127:64] SRC1[127:64]
DEST[MAXVL-1:128] 0
Invalid, Precision
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
Other Exceptions
See Exceptions Type E3.
Description
Perform a reduction transformation of the binary encoded single-precision FP value in the low dword element of the
second source operand (the third operand) and store the reduced result in binary FP format to the low dword
element of the destination operand (the first operand) under the writemask k1. Bits 127:32 of the destination
operand are copied from respective dword elements of the first source operand (the second operand).
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source
value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduction transfor-
mation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary FP numbers with normalized significand and bi-
ased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e. Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
Handling of special case of input values are listed in Table 5-15.
Operation
ReduceArgumentSP(SRC[31:0], imm8[7:0])
{
// Check for NaN
IF (SRC [31:0] = NAN) THEN
RETURN (Convert SRC[31:0] to QNaN); FI
M imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted
RC imm8[1:0];// Round Control for ROUND() operation
RC source imm[2];
SPE imm[3];// Suppress Precision Exception
TMP[31:0] 2-M *{ROUND(2M*SRC[31:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values
TMP[31:0] SRC[31:0] – TMP[31:0]; // subtraction under the same RC,SPE controls
RETURN TMP[31:0]; // binary encoded FP with biased exponent and normalized significand
}
VREDUCESS
IF k1[0] or *no writemask*
THEN DEST[31:0] ReduceArgumentSP(SRC2[31:0], imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] = 0
FI;
FI;
DEST[127:32] SRC1[127:32]
DEST[MAXVL-1:128] 0
Invalid, Precision
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
Other Exceptions
See Exceptions Type E3.
Description
In 64-bit mode, the instruction zeroes XMM0-XMM15, YMM0-YMM15, and ZMM0-ZMM15. Outside 64-bit mode, it
zeroes only XMM0-XMM7, YMM0-YMM7, and ZMM0-ZMM7. VZEROALL does not modify ZMM16-ZMM31.
Note: VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. In Compatibility and legacy 32-bit
mode only the lower 8 registers are modified.
Operation
simd_reg_file[][] is a two dimensional array representing the SIMD register file containing all the overlapping xmm, ymm and zmm
registers present in that implementation. The major dimension is the register number: 0 for xmm0, ymm0 and zmm0; 1 for xmm1,
ymm1, and zmm1; etc. The minor dimension size is the width of the implemented SIMD state measured in bits. On a machine
supporting Intel AVX-512, the width is 512.
Other Exceptions
See Exceptions Type 8.
Description
In 64-bit mode, the instruction zeroes the bits in positions 128 and higher in YMM0-YMM15 and ZMM0-ZMM15.
Outside 64-bit mode, it zeroes those bits only in YMM0-YMM7 and ZMM0-ZMM7. VZEROUPPER does not modify the
lower 128 bits of these registers and it does not modify ZMM16-ZMM31.
This instruction is recommended when transitioning between AVX and legacy SSE code; it will eliminate perfor-
mance penalties caused by false dependencies.
Note: VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. In Compatibility and legacy 32-bit
mode only the lower 8 registers are modified.
Operation
simd_reg_file[][] is a two dimensional array representing the SIMD register file containing all the overlapping xmm, ymm and zmm
registers present in that implementation. The major dimension is the register number: 0 for xmm0, ymm0 and zmm0; 1 for xmm1,
ymm1, and zmm1; etc. The minor dimension size is the width of the implemented SIMD state measured in bits.
VZEROUPPER
IF (64-bit mode)
limit 15
ELSE
limit 7
FOR i in 0 .. limit:
simd_reg_file[i][MAXVL-1:128] 0
Other Exceptions
See Exceptions Type 8.
Description
Writes bytes in register source to the shadow stack.
Operation
IF CPL = 3
IF (CR4.CET & IA32_U_CET.SH_STK_EN) = 0
THEN #UD; FI;
IF (IA32_U_CET.WR_SHSTK_EN) = 0
THEN #UD; FI;
ELSE
IF (CR4.CET & IA32_S_CET.SH_STK_EN) = 0
THEN #UD; FI;
IF (IA32_S_CET.WR_SHSTK_EN) = 0
THEN #UD; FI;
FI;
DEST_LA = Linear_Address(mem operand)
IF (operand size is 64 bit)
THEN
(* Destination not 8B aligned *)
IF DEST_LA[2:0]
THEN GP(0); FI;
Shadow_stack_store 8 bytes of SRC to DEST_LA;
ELSE
(* Destination not 4B aligned *)
IF DEST_LA[1:0]
THEN GP(0); FI;
Shadow_stack_store 4 bytes of SRC[31:0] to DEST_LA;
FI;
Flags Affected
None.
Description
Writes bytes in register source to a user shadow stack page. The WRUSS instruction can be executed only if CPL =
0, however the processor treats its shadow-stack accesses as user accesses.
Operation
IF CR4.CET = 0
THEN #UD; FI;
IF CPL > 0
THEN #GP(0); FI;
DEST_LA = Linear_Address(mem operand)
IF (operand size is 64 bit)
THEN
(* Destination not 8B aligned *)
IF DEST_LA[2:0]
THEN GP(0); FI;
Shadow_stack_store 8 bytes of SRC to DEST_LA as user-mode access;
ELSE
(* Destination not 4B aligned *)
IF DEST_LA[1:0]
THEN GP(0); FI;
Shadow_stack_store 4 bytes of SRC[31:0] to DEST_LA as user-mode access;
FI;
Flags Affected
None.
Description
The XBEGIN instruction specifies the start of an RTM code region. If the logical processor was not already in trans-
actional execution, then the XBEGIN instruction causes the logical processor to transition into transactional execu-
tion. The XBEGIN instruction that transitions the logical processor into transactional execution is referred to as the
outermost XBEGIN instruction. The instruction also specifies a relative offset to compute the address of the fallback
code path following a transactional abort. (Use of the 16-bit operand size does not cause this address to be trun-
cated to 16 bits, unlike a near jump to a relative offset.)
On an RTM abort, the logical processor discards all architectural register and memory updates performed during
the RTM execution and restores architectural state to that corresponding to the outermost XBEGIN instruction. The
fallback address following an abort is computed from the outermost XBEGIN instruction.
Operation
XBEGIN
IF RTM_NEST_COUNT < MAX_RTM_NEST_COUNT
THEN
RTM_NEST_COUNT++
IF RTM_NEST_COUNT = 1 THEN
IF 64-bit Mode
THEN
IF OperandSize = 16
THEN fallbackRIP ← RIP + SignExtend64(rel16);
ELSE fallbackRIP ← RIP + SignExtend64(rel32);
FI;
IF fallbackRIP is not canonical
THEN #GP(0);
FI;
ELSE
IF OperandSize = 16
THEN fallbackEIP ← EIP + SignExtend32(rel16);
ELSE fallbackEIP ← EIP + rel32;
FI;
IF fallbackEIP outside code segment limit
THEN #GP(0);
FI;
FI;
RTM_ACTIVE ← 1
Enter RTM Execution (* record register state, start tracking memory state*)
FI; (* RTM_NEST_COUNT = 1 *)
ELSE (* RTM_NEST_COUNT = MAX_RTM_NEST_COUNT *)
GOTO RTM_ABORT_PROCESSING
FI;
Flags Affected
None
Description
Performs a full or partial save of processor state components to the XSAVE area located at the memory address
specified by the destination operand. The implicit EDX:EAX register pair specifies a 64-bit instruction mask. The
specific state components saved correspond to the bits set in the requested-feature bitmap (RFBM), the logical-
AND of EDX:EAX and the logical-OR of XCR0 with the IA32_XSS MSR. XSAVES may be executed only if CPL = 0.
The format of the XSAVE area is detailed in Section 13.4, “XSAVE Area,” of Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 1. Like FXRSTOR and FXSAVE, the memory format used for x87 state depends
on a REX.W prefix; see Section 13.5.1, “x87 State” of Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 1.
Section 13.11, “Operation of XSAVES,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
1 provides a detailed description of the operation of the XSAVES instruction. The following items provide a high-
level outline:
• Execution of XSAVES is similar to that of XSAVEC. XSAVES differs from XSAVEC in that it can save state
components corresponding to bits set in the IA32_XSS MSR and that it may use the modified optimization.
• XSAVES saves state component i only if RFBM[i] = 1 and XINUSE[i] = 1.1 (XINUSE is a bitmap by which the
processor tracks the status of various state components. See Section 13.6, “Processor Tracking of XSAVE-
Managed State” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.) Even if both
bits are 1, XSAVES may optimize and not save state component i if (1) state component i has not been
modified since the last execution of XRSTOR or XRSTORS; and (2) this execution of XSAVES correspond to that
last execution of XRSTOR or XRSTORS as determined by XRSTOR_INFO (see the Operation section below).
• XSAVES does not modify bytes 511:464 of the legacy region of the XSAVE area (see Section 13.4.1, “Legacy
Region of an XSAVE Area” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1).
• XSAVES writes the logical AND of RFBM and XINUSE to the XSTATE_BV field of the XSAVE header.2 (See Section
13.4.2, “XSAVE Header” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.)
XSAVES sets bit 63 of the XCOMP_BV field and sets bits 62:0 of that field to RFBM[62:0]. XSAVES does not
write to any parts of the XSAVE header other than the XSTATE_BV and XCOMP_BV fields.
• XSAVES always uses the compacted format of the extended region of the XSAVE area (see Section 13.4.3,
“Extended Region of an XSAVE Area” of Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 1).
Use of a destination operand not aligned to 64-byte boundary (in either 64-bit or 32-bit modes) results in a
general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored.
1. There is an exception for state component 1 (SSE). MXCSR is part of SSE state, but XINUSE[1] may be 0 even if MXCSR does not
have its initial value of 1F80H. In this case, the init optimization does not apply and XSAVEC will save SSE state as long as RFBM[1] =
1 and the modified optimization is not being applied.
2. There is an exception for state component 1 (SSE). MXCSR is part of SSE state, but XINUSE[1] may be 0 even if MXCSR does not
have its initial value of 1F80H. In this case, XSAVES sets XSTATE_BV[1] to 1 as long as RFBM[1] = 1.
See Section 13.6, “Processor Tracking of XSAVE-Managed State,” of Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1 for discussion of the bitmap XMODIFIED and of the quantity XRSTOR_INFO.
Operation
IF TO_BE_SAVED[0] = 1
THEN store x87 state into legacy region of XSAVE area;
FI;
IF TO_BE_SAVED[1] = 1
THEN store SSE state into legacy region of XSAVE area; // this step saves the XMM registers, MXCSR, and MXCSR_MASK
FI;
NEXT_FEATURE_OFFSET = 576; // Legacy area and XSAVE header consume 576 bytes
FOR i ← 2 TO 62
IF RFBM[i] = 1
THEN
IF TO_BE_SAVED[i]
THEN
save XSAVE state component i at offset NEXT_FEATURE_OFFSET from base of XSAVE area;
IF i = 8 // state component 8 is for PT state
THEN IA32_RTIT_CTL.TraceEn[bit 0] ← 0;
FI;
FI;
NEXT_FEATURE_OFFSET = NEXT_FEATURE_OFFSET + n (n enumerated by CPUID(EAX=0DH,ECX=i):EAX);
FI;
ENDFOR;
Flags Affected
None.
Description
The XTEST instruction queries the transactional execution status. If the instruction executes inside a transaction-
ally executing RTM region or a transactionally executing HLE region, then the ZF flag is cleared, else it is set.
Operation
XTEST
IF (RTM_ACTIVE = 1 OR HLE_ACTIVE = 1)
THEN
ZF ← 0
ELSE
ZF ← 1
FI;
Flags Affected
The ZF flag is cleared if the instruction is executed transactionally; otherwise it is set to 1. The CF, OF, SF, PF, and
AF, flags are cleared.
Other Exceptions
#UD CPUID.(EAX=7, ECX=0):EBX.HLE[bit 4] = 0 and CPUID.(EAX=7, ECX=0):EBX.RTM[bit 11] =
0.
If LOCK prefix is used.
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming Guide, Part
1 (order number 253668), the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B:
System Programming Guide, Part 2 (order number 253669), the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3C: System Programming Guide, Part 3 (order number 326019), and the Intel® 64
and IA-32 Architectures Software Developer’s Manual, Volume 3D:System Programming Guide, Part 4 (order
number 332831) are part of a set that describes the architecture and programming environment of Intel 64 and IA-
32 Architecture processors. The other volumes in this set are:
• Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture (order number
253665).
• Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C & 2D: Instruction Set
Reference (order numbers 253666, 253667, 326018 and 334569).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model-Specific Registers
(order number 335592).
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, describes the basic architecture
and programming environment of Intel 64 and IA-32 processors. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volumes 2A, 2B, 2C & 2D, describe the instruction set of the processor and the opcode struc-
ture. These volumes apply to application programmers and to programmers who write operating systems or exec-
utives. The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C & 3D, describe
the operating-system support environment of Intel 64 and IA-32 processors. These volumes target operating-
system and BIOS designers. In addition, Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
3B, and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C address the programming
environment for classes of software that host operating systems. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 4, describes the model-specific registers of Intel 64 and IA-32 processors.
Vol. 3A 1-1
ABOUT THIS MANUAL
1-2 Vol. 3A
ABOUT THIS MANUAL
Vol. 3A 1-3
ABOUT THIS MANUAL
The Intel® Xeon® Processor Scalable Family, Intel® Xeon® processor E3-1500m v5 product family and 6th gener-
ation Intel® Core™ processors are based on the Skylake microarchitecture and support Intel 64 architecture.
The 7th generation Intel® Core™ processors are based on the Kaby Lake microarchitecture and support Intel 64
architecture.
The Intel® Atom™ processor C series, the Intel® Atom™ processor X series, the Intel® Pentium® processor J
series, the Intel® Celeron® processor J series, and the Intel® Celeron® processor N series are based on the Gold-
mont microarchitecture.
The Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series is based on the Knights Landing microarchitecture and
supports Intel 64 architecture.
The Intel® Pentium® Silver processor series, the Intel® Celeron® processor J series, and the Intel® Celeron®
processor N series are based on the Goldmont Plus microarchitecture.
The 8th generation Intel® Core™ processors, 9th generation Intel® Core™ processors, and Intel® Xeon® E proces-
sors are based on the Coffee Lake microarchitecture and support Intel 64 architecture.
The Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series is based on the Knights Mill microarchitecture and
supports Intel 64 architecture.
The 2nd generation Intel® Xeon® Processor Scalable Family is based on the Cascade Lake product and supports
Intel 64 architecture.
The 10th generation Intel® Core™ processors are based on the Ice Lake microarchitecture and support Intel 64
architecture.
IA-32 architecture is the instruction set architecture and programming environment for Intel's 32-bit microproces-
sors. Intel® 64 architecture is the instruction set architecture and programming environment which is the superset
of Intel’s 32-bit and 64-bit architectures. It is compatible with the IA-32 architecture.
1. Model-Specific Registers have been moved out of this volume and into a separate volume: Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 4.
1-4 Vol. 3A
ABOUT THIS MANUAL
Chapter 8 — Multiple-Processor Management. Describes the instructions and flags that support multiple
processors with shared memory, memory ordering, and Intel® Hyper-Threading Technology. Includes MP initializa-
tion for P6 family processors and gives an example of how to use the MP protocol to boot P6 family processors in
an MP system.
Chapter 9 — Processor Management and Initialization. Defines the state of an Intel 64 or IA-32 processor
after reset initialization. This chapter also explains how to set up an Intel 64 or IA-32 processor for real-address
mode operation and protected- mode operation, and how to switch between modes.
Chapter 10 — Advanced Programmable Interrupt Controller (APIC). Describes the programming interface
to the local APIC and gives an overview of the interface between the local APIC and the I/O APIC. Includes APIC bus
message formats and describes the message formats for messages transmitted on the APIC bus for P6 family and
Pentium processors.
Chapter 11 — Memory Cache Control. Describes the general concept of caching and the caching mechanisms
supported by the Intel 64 or IA-32 architectures. This chapter also describes the memory type range registers
(MTRRs) and how they can be used to map memory types of physical memory. Information on using the new cache
control and memory streaming instructions introduced with the Pentium III, Pentium 4, and Intel Xeon processors
is also given.
Chapter 12 — Intel® MMX™ Technology System Programming. Describes those aspects of the Intel® MMX™
technology that must be handled and considered at the system programming level, including: task switching,
exception handling, and compatibility with existing system environments.
Chapter 13 — System Programming For Instruction Set Extensions And Processor Extended States.
Describes the operating system requirements to support SSE/SSE2/SSE3/SSSE3/SSE4 extensions, including task
switching, exception handling, and compatibility with existing system environments. The latter part of this chapter
describes the extensible framework of operating system requirements to support processor extended states.
Processor extended state may be required by instruction set extensions beyond those of
SSE/SSE2/SSE3/SSSE3/SSE4 extensions.
Chapter 14 — Power and Thermal Management. Describes facilities of Intel 64 and IA-32 architecture used for
power management and thermal monitoring.
Chapter 15 — Machine-Check Architecture. Describes the machine-check architecture and machine-check
exception mechanism found in the Pentium 4, Intel Xeon, and P6 family processors. Additionally, a signaling mech-
anism for software to respond to hardware corrected machine check error is covered.
Chapter 16 — Interpreting Machine-Check Error Codes. Gives an example of how to interpret the error codes
for a machine-check error that occurred on a P6 family processor.
Chapter 17 — Debug, Branch Profile, TSC, and Resource Monitoring Features. Describes the debugging
registers and other debug mechanism provided in Intel 64 or IA-32 processors. This chapter also describes the
time-stamp counter.
Chapter 18 — Performance Monitoring. Describes the Intel 64 and IA-32 architectures’ facilities for monitoring
performance.
Chapter 19 — Performance-Monitoring Events. Lists architectural performance events. Non-architectural
performance events (i.e. model-specific events) are listed for each generation of microarchitecture.
Chapter 20 — 8086 Emulation. Describes the real-address and virtual-8086 modes of the IA-32 architecture.
Chapter 21 — Mixing 16-Bit and 32-Bit Code. Describes how to mix 16-bit and 32-bit code modules within the
same program or task.
Chapter 22 — IA-32 Architecture Compatibility. Describes architectural compatibility among IA-32 proces-
sors.
Chapter 23 — Introduction to Virtual Machine Extensions. Describes the basic elements of virtual machine
architecture and the virtual machine extensions for Intel 64 and IA-32 Architectures.
Chapter 24 — Virtual Machine Control Structures. Describes components that manage VMX operation. These
include the working-VMCS pointer and the controlling-VMCS pointer.
Chapter 25 — VMX Non-Root Operation. Describes the operation of a VMX non-root operation. Processor oper-
ation in VMX non-root mode can be restricted programmatically such that certain operations, events or conditions
Vol. 3A 1-5
ABOUT THIS MANUAL
can cause the processor to transfer control from the guest (running in VMX non-root mode) to the monitor software
(running in VMX root mode).
Chapter 26 — VM Entries. Describes VM entries. VM entry transitions the processor from the VMM running in VMX
root-mode to a VM running in VMX non-root mode. VM-Entry is performed by the execution of VMLAUNCH or VMRE-
SUME instructions.
Chapter 27 — VM Exits. Describes VM exits. Certain events, operations or situations while the processor is in VMX
non-root operation may cause VM-exit transitions. In addition, VM exits can also occur on failed VM entries.
Chapter 28 — VMX Support for Address Translation. Describes virtual-machine extensions that support
address translation and the virtualization of physical memory.
Chapter 29 — APIC Virtualization and Virtual Interrupts. Describes the VMCS including controls that enable
the virtualization of interrupts and the Advanced Programmable Interrupt Controller (APIC).
Chapter 30 — VMX Instruction Reference. Describes the virtual-machine extensions (VMX). VMX is intended
for a system executive to support virtualization of processor hardware and a system software layer acting as a host
to multiple guest software environments.
Chapter 31 — Virtual-Machine Monitor Programming Considerations. Describes programming consider-
ations for VMMs. VMMs manage virtual machines (VMs).
Chapter 32 — Virtualization of System Resources. Describes the virtualization of the system resources. These
include: debugging facilities, address translation, physical memory, and microcode update facilities.
Chapter 33 — Handling Boundary Conditions in a Virtual Machine Monitor. Describes what a VMM must
consider when handling exceptions, interrupts, error conditions, and transitions between activity states.
Chapter 34 — System Management Mode. Describes Intel 64 and IA-32 architectures’ system management
mode (SMM) facilities.
Chapter 35 — Intel® Processor Trace. Describes details of Intel® Processor Trace.
Chapter 36 — Introduction to Intel® Software Guard Extensions. Provides an overview of the Intel® Soft-
ware Guard Extensions (Intel® SGX) set of instructions.
Chapter 37 — Enclave Access Control and Data Structures. Describes Enclave Access Control procedures and
defines various Intel SGX data structures.
Chapter 38 — Enclave Operation. Describes enclave creation and initialization, adding pages and measuring an
enclave, and enclave entry and exit.
Chapter 39 — Enclave Exiting Events. Describes enclave-exiting events (EEE) and asynchronous enclave exit
(AEX).
Chapter 40 — SGX Instruction References. Describes the supervisor and user level instructions provided by
Intel SGX.
Chapter 41 — Intel® SGX Interactions with IA32 and Intel® 64 Architecture. Describes the Intel SGX
collection of enclave instructions for creating protected execution environments on processors supporting IA32 and
Intel 64 architectures.
Chapter 42 — Enclave Code Debug and Profiling. Describes enclave code debug processes and options.
Appendix A — VMX Capability Reporting Facility. Describes the VMX capability MSRs. Support for specific VMX
features is determined by reading capability MSRs.
Appendix B — Field Encoding in VMCS. Enumerates all fields in the VMCS and their encodings. Fields are
grouped by width (16-bit, 32-bit, etc.) and type (guest-state, host-state, etc.).
Appendix C — VM Basic Exit Reasons. Describes the 32-bit fields that encode reasons for a VM exit. Examples
of exit reasons include, but are not limited to: software interrupts, processor exceptions, software traps, NMIs,
external interrupts, and triple faults.
1-6 Vol. 3A
ABOUT THIS MANUAL
NOTE
Avoid any software dependence upon the state of reserved bits in Intel 64 and IA-32 registers.
Depending upon the values of reserved register bits will make software dependent upon the
unspecified manner in which the processor handles these bits. Programs that depend upon
reserved values risk incompatibility with future processors.
Data Structure
Highest 24 23 16 15 8 7 0 Bit offset
31
Address
28
24
20
16
12
8
4
Lowest
Byte 3 Byte 2 Byte 1 Byte 0 0 Address
Byte Offset
Vol. 3A 1-7
ABOUT THIS MANUAL
DS:FF79H
The following segment address identifies an instruction address in the code segment. The CS register points to the
code segment and the EIP register contains the address of the instruction.
CS:EIP
1-8 Vol. 3A
ABOUT THIS MANUAL
CPUID.01H:EDX.SSE[bit 25] = 1
CR4.OSFXSR[bit 9] = 1
Example CR name
IA32_MISC_ENABLE.ENABLEFOPCODE[bit 2] = 1
S 29002
Figure 1-2. Syntax for CPUID, CR, and MSR Data Presentation
1.3.7 Exceptions
An exception is an event that typically occurs when an instruction causes an error. For example, an attempt to
divide by zero generates an exception. However, some exceptions, such as breakpoints, occur under other condi-
tions. Some types of exceptions may provide error codes. An error code reports additional information about the
error. An example of the notation used to show an exception and error code is shown below:
#PF(fault code)
This example refers to a page-fault exception under conditions where an error code naming a type of fault is
reported. Under some conditions, exceptions which produce error codes may not be able to report an accurate
code. In this case, the error code is zero, as shown below for a general-protection exception:
#GP(0)
Vol. 3A 1-9
ABOUT THIS MANUAL
1-10 Vol. 3A
11.Updates to Chapter 4, Volume 3A
Change bars and green text show changes to Chapter 4 of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter: Added Control-flow Enforcement Technology details where applicable.
Chapter 3 explains how segmentation converts logical addresses to linear addresses. Paging (or linear-address
translation) is the process of translating linear addresses so that they can be used to access memory or I/O
devices. Paging translates each linear address to a physical address and determines, for each translation, what
accesses to the linear address are allowed (the address’s access rights) and the type of caching used for such
accesses (the address’s memory type).
Intel-64 processors support three different paging modes. These modes are identified and defined in Section 4.1.
Section 4.2 gives an overview of the translation mechanism that is used in all modes. Section 4.3, Section 4.4, and
Section 4.5 discuss the three paging modes in detail.
Section 4.6 details how paging determines and uses access rights. Section 4.7 discusses exceptions that may be
generated by paging (page-fault exceptions). Section 4.8 considers data which the processor writes in response to
linear-address accesses (accessed and dirty flags).
Section 4.9 describes how paging determines the memory types used for accesses to linear addresses. Section
4.10 provides details of how a processor may cache information about linear-address translation. Section 4.11
outlines interactions between paging and certain VMX features. Section 4.12 gives an overview of how paging can
be used to implement virtual memory.
Vol. 3A 4-1
PAGING
• If CR0.PG = 1, CR4.PAE = 1, and IA32_EFER.LME = 0, PAE paging is used. PAE paging is detailed in Section
4.4. PAE paging uses CR0.WP, CR4.PGE, CR4.SMEP, CR4.SMAP, CR4.CET, and IA32_EFER.NXE as described in
Section 4.1.3 and Section 4.6.
• If CR0.PG = 1, CR4.PAE = 1, and IA32_EFER.LME = 1, 4-level paging1 is used.2 4-level paging is detailed in
Section 4.5. 4-level paging uses CR0.WP, CR4.PGE, CR4.PCIDE, CR4.SMEP, CR4.SMAP, CR4.PKE, CR4.CET, and
IA32_EFER.NXE as described in Section 4.1.3 and Section 4.6. 4-level paging is available only on processors
that support the Intel 64 architecture.
The three paging modes differ with regard to the following details:
• Linear-address width. The size of the linear addresses that can be translated.
• Physical-address width. The size of the physical addresses produced by paging.
• Page size. The granularity at which linear addresses are translated. Linear addresses on the same page are
translated to corresponding physical addresses on the same page.
• Support for execute-disable access rights. In some paging modes, software can be prevented from fetching
instructions from pages that are otherwise readable.
• Support for PCIDs. With 4-level paging, software can enable a facility by which a logical processor caches
information for multiple linear-address spaces. The processor may retain cached information when software
switches between different linear-address spaces.
• Support for protection keys. With 4-level paging, software can enable a facility by which each linear address is
associated with a protection key. Software can use a new control register to determine, for each protection
keys, how software can access linear addresses associated with that protection key.
Table 4-1 illustrates the principal differences between the three paging modes.
Supports
Lin.- Phys.- Supports
Paging PG in PAE in LME in Page PCIDs and
Addr. Addr. Execute-
Mode CR0 CR4 IA32_EFER Sizes protection
Width Width1 Disable?
keys?
Up to 4 KB
32-bit 1 0 02 32 No No
403 4 MB4
Up to 4 KB
PAE 1 1 0 32 Yes5 No
52 2 MB
4 KB
Up to
4-level 1 1 1 48 2 MB Yes5 Yes7
52
1 GB6
NOTES:
1. The physical-address width is always bounded by MAXPHYADDR; see Section 4.1.4.
2. The processor ensures that IA32_EFER.LME must be 0 if CR0.PG = 1 and CR4.PAE = 0.
3. 32-bit paging supports physical-address widths of more than 32 bits only for 4-MByte pages and only if the PSE-36 mechanism is
supported; see Section 4.1.4 and Section 4.3.
4. 4-MByte pages are used with 32-bit paging only if CR4.PSE = 1; see Section 4.3.
5. Execute-disable access rights are applied only if IA32_EFER.NXE = 1; see Section 4.6.
6. Not all processors that support 4-level paging support 1-GByte pages; see Section 4.1.4.
7. PCIDs are used only if CR4.PCIDE = 1; see Section 4.10.1. Protection keys are used only if certain conditions hold; see Section 4.6.2.
1. Earlier versions of this manual used the term “IA-32e paging” to identify 4-level paging.
2. The LMA flag in the IA32_EFER MSR (bit 10) is a status bit that indicates whether the logical processor is in IA-32e mode (and thus
using 4-level paging). The processor always sets IA32_EFER.LMA to CR0.PG & IA32_EFER.LME. Software cannot directly modify
IA32_EFER.LMA; an execution of WRMSR to the IA32_EFER MSR ignores bit 10 of its source operand.
4-2 Vol. 3A
PAGING
Because they are used only if IA32_EFER.LME = 0, 32-bit paging and PAE paging are used only in legacy protected
mode. Because legacy protected mode cannot produce linear addresses larger than 32 bits, 32-bit paging and PAE
paging translate 32-bit linear addresses.
Because it is used only if IA32_EFER.LME = 1, 4-level paging is used only in IA-32e mode. (In fact, it is the use of
4-level paging that defines IA-32e mode.) IA-32e mode has two sub-modes:
• Compatibility mode. This mode uses only 32-bit linear addresses. 4-level paging treats bits 47:32 of such an
address as all 0.
• 64-bit mode. While this mode produces 64-bit linear addresses, the processor ensures that bits 63:47 of such
an address are identical.1 4-level paging does not use bits 63:48 of such addresses.
#GP #GP
Set LME
Set LME
#GP
Clear LME
Clear LME
Clear LME
Set PG
No Paging
PG = 0
PAE = 1
LME = 1
• IA32_EFER.LME cannot be modified while paging is enabled (CR0.PG = 1). Attempts to do so using WRMSR
cause a general-protection exception (#GP(0)).
1. Such an address is called canonical. Use of a non-canonical linear address in 64-bit mode produces a general-protection exception
(#GP(0)); the processor does not attempt to translate non-canonical linear addresses using 4-level paging.
Vol. 3A 4-3
PAGING
• Paging cannot be enabled (by setting CR0.PG to 1) while CR4.PAE = 0 and IA32_EFER.LME = 1. Attempts to do
so using MOV to CR0 cause a general-protection exception (#GP(0)).
• CR4.PAE cannot be cleared while 4-level paging is active (CR0.PG = 1 and IA32_EFER.LME = 1). Attempts to
do so using MOV to CR4 cause a general-protection exception (#GP(0)).
• Regardless of the current paging mode, software can disable paging by clearing CR0.PG with MOV to CR0.1
• Software can make transitions between 32-bit paging and PAE paging by changing the value of CR4.PAE with
MOV to CR4.
• Software cannot make transitions directly between 4-level paging and either of the other two paging modes. It
must first disable paging (by clearing CR0.PG with MOV to CR0), then set CR4.PAE and IA32_EFER.LME to the
desired values (with MOV to CR4 and WRMSR), and then re-enable paging (by setting CR0.PG with MOV to
CR0). As noted earlier, an attempt to clear either CR4.PAE or IA32_EFER.LME cause a general-protection
exception (#GP(0)).
• VMX transitions allow transitions between paging modes that are not possible using MOV to CR or WRMSR. This
is because VMX transitions can load CR0, CR4, and IA32_EFER in one operation. See Section 4.11.1.
1. If CR4.PCIDE = 1, an attempt to clear CR0.PG causes a general-protection exception (#GP); software should clear CR4.PCIDE before
attempting to disable paging.
4-4 Vol. 3A
PAGING
shadow-stack pages. Section 4.6 explains how access rights are determined for these accesses and pages. (The
processor allows CR4.CET to be set only if CR0.WP is also set.)
IA32_EFER.NXE enables execute-disable access rights for PAE paging and 4-level paging. If IA32_EFER.NXE = 1,
instruction fetches can be prevented from specified linear addresses (even if data reads from the addresses are
allowed). Section 4.6 explains how access rights are determined. (IA32_EFER.NXE has no effect with 32-bit
paging. Software that wants to use this feature to limit instruction fetches from readable pages must use either PAE
paging or 4-level paging.)
Vol. 3A 4-5
PAGING
• CPUID.80000008H:EAX[7:0] reports the physical-address width supported by the processor. (For processors
that do not support CPUID function 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] = 1
and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR is at most 52.
• CPUID.80000008H:EAX[15:8] reports the linear-address width supported by the processor. Generally, this
value is 48 if CPUID.80000001H:EDX.LM [bit 29] = 1 and 32 otherwise. (Processors that do not support CPUID
function 80000008H, support a linear-address width of 32.)
4-6 Vol. 3A
PAGING
• If more than 12 bits remain in the linear address, bit 7 (PS — page size) of the current paging-structure entry
is consulted. If the bit is 0, the entry references another paging structure; if the bit is 1, the entry maps a page.
• If only 12 bits remain in the linear address, the current paging-structure entry always maps a page (bit 7 is
used for other purposes).
If a paging-structure entry maps a page when more than 12 bits remain in the linear address, the entry identifies
a page frame larger than 4 KBytes. For example, 32-bit paging uses the upper 10 bits of a linear address to locate
the first paging-structure entry; 22 bits remain. If that entry maps a page, the page frame is 222 Bytes = 4 MBytes.
32-bit paging supports 4-MByte pages if CR4.PSE = 1. PAE paging and 4-level paging support 2-MByte pages
(regardless of the value of CR4.PSE). 4-level paging may support 1-GByte pages (see Section 4.1.4).
Paging structures are given different names based on their uses in the translation process. Table 4-2 gives the
names of the different paging structures. It also provides, for each structure, the source of the physical address
used to locate it (CR3 or a different paging-structure entry); the bits in the linear address used to select an entry
from the structure; and details of whether and how such an entry can map a page.
Physical
Entry Bits Selecting
Paging Structure Paging Mode Address of Page Mapping
Name Entry
Structure
32-bit N/A
Page-directory-
PDPTE PAE CR3 31:30 N/A (PS must be 0)
pointer table
4-level PML4E 38:30 1-GByte page if PS=11
NOTES:
1. Not all processors allow the PS flag to be 1 in PDPTEs; see Section 4.1.4 for how to determine whether 1-GByte pages are supported.
2. 32-bit paging ignores the PS flag in a PDE (and uses the entry to reference a page table) unless CR4.PSE = 1. Not all processors allow
CR4.PSE to be 1; see Section 4.1.4 for how to determine whether 4-MByte pages are supported with 32-bit paging.
1. Bits in the range 39:32 are 0 in any physical address used by 32-bit paging except those used to map 4-MByte pages. If the proces-
sor does not support the PSE-36 mechanism, this is true also for physical addresses used to map 4-MByte pages. If the processor
does support the PSE-36 mechanism and MAXPHYADDR < 40, bits in the range 39:MAXPHYADDR are 0 in any physical address used
to map a 4-MByte page. (The corresponding bits are reserved in PDEs.) See Section 4.1.4 for how to determine MAXPHYADDR and
whether the PSE-36 mechanism is supported.
Vol. 3A 4-7
PAGING
32-bit paging may map linear addresses to either 4-KByte pages or 4-MByte pages. Figure 4-2 illustrates the trans-
lation process when it uses a 4-KByte page; Figure 4-3 covers the case of a 4-MByte page. The following items
describe the 32-bit paging process in more detail as well has how the page size is determined:
• A 4-KByte naturally aligned page directory is located at the physical address specified in bits 31:12 of CR3 (see
Table 4-3). A page directory comprises 1024 32-bit entries (PDEs). A PDE is selected using the physical address
defined as follows:
— Bits 39:32 are all 0.
— Bits 31:12 are from CR3.
— Bits 11:2 are bits 31:22 of the linear address.
— Bits 1:0 are 0.
Because a PDE is identified using bits 31:22 of the linear address, it controls access to a 4-Mbyte region of the
linear-address space. Use of the PDE depends on CR4.PSE and the PDE’s PS flag (bit 7):
• If CR4.PSE = 1 and the PDE’s PS flag is 1, the PDE maps a 4-MByte page (see Table 4-4). The final physical
address is computed as follows:
— Bits 39:32 are bits 20:13 of the PDE.
— Bits 31:22 are bits 31:22 of the PDE.1
— Bits 21:0 are from the original linear address.
• If CR4.PSE = 0 or the PDE’s PS flag is 0, a 4-KByte naturally aligned page table is located at the physical
address specified in bits 31:12 of the PDE (see Table 4-5). A page table comprises 1024 32-bit entries (PTEs).
A PTE is selected using the physical address defined as follows:
— Bits 39:32 are all 0.
— Bits 31:12 are from the PDE.
— Bits 11:2 are bits 21:12 of the linear address.
— Bits 1:0 are 0.
• Because a PTE is identified using bits 31:12 of the linear address, every PTE maps a 4-KByte page (see
Table 4-6). The final physical address is computed as follows:
— Bits 39:32 are all 0.
— Bits 31:12 are from the PTE.
— Bits 11:0 are from the original linear address.
If a paging-structure entry’s P flag (bit 0) is 0 or if the entry sets any reserved bit, the entry is used neither to refer-
ence another paging-structure entry nor to map a page. There is no translation for a linear address whose transla-
tion would use such a paging-structure entry; a reference to such a linear address causes a page-fault exception
(see Section 4.7).
With 32-bit paging, there are reserved bits only if CR4.PSE = 1:
• If the P flag and the PS flag (bit 7) of a PDE are both 1, the bits reserved depend on MAXPHYADDR, and whether
the PSE-36 mechanism is supported:2
— If the PSE-36 mechanism is not supported, bits 21:13 are reserved.
— If the PSE-36 mechanism is supported, bits 21:(M–19) are reserved, where M is the minimum of 40 and
MAXPHYADDR.
• If the PAT is not supported:3
— If the P flag of a PTE is 1, bit 7 is reserved.
1. The upper bits in the final physical address do not all come from corresponding positions in the PDE; the physical-address bits in the
PDE are not all contiguous.
2. See Section 4.1.4 for how to determine MAXPHYADDR and whether the PSE-36 mechanism is supported.
3. See Section 4.1.4 for how to determine whether the PAT is supported.
4-8 Vol. 3A
PAGING
— If the P flag and the PS flag of a PDE are both 1, bit 12 is reserved.
(If CR4.PSE = 0, no bits are reserved with 32-bit paging.)
A reference using a linear address that is successfully translated to a physical address is performed only if allowed
by the access rights of the translation; see Section 4.6.
Linear Address
31 22 21 12 11 0
Directory Table Offset
12 4-KByte Page
PTE
20
PDE with PS=0
20
32
CR3
Linear Address
31 22 21 0
Directory Offset
22 4-MByte Page
10 Page Directory
Physical Address
32
CR3
Figure 4-4 gives a summary of the formats of CR3 and the paging-structure entries with 32-bit paging. For the
paging structure entries, it identifies separately the format of entries that map pages, those that reference other
paging structures, and those that do neither because they are “not present”; bit 0 (P) and bit 7 (PS) are high-
lighted because they determine how such an entry is used.
Vol. 3A 4-9
PAGING
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
P P
Address of page directory1 Ignored C W Ignored CR3
D T
P P P U R PDE:
Bits 31:22 of address Reserved Bits 39:32 of A
2 Ignored G 1 D A C W / / 1 4MB
of 4MB page frame (must be 0) address T D T S W page
I P P U R PDE:
Address of page table Ignored 0 g A C W / / 1 page
n D T S W table
PDE:
Ignored 0 not
present
P P P U R PTE:
Address of 4KB page frame Ignored G A D A C W / / 1 4KB
T D T S W page
PTE:
Ignored 0 not
present
Figure 4-4. Formats of CR3 and Paging-Structure Entries with 32-Bit Paging
NOTES:
1. CR3 has 64 bits on processors supporting the Intel-64 architecture. These bits are ignored with 32-bit paging.
2. This example illustrates a processor in which MAXPHYADDR is 36. If this value is larger or smaller, the number of bits reserved in
positions 20:13 of a PDE mapping a 4-MByte page will change.
Bit Contents
Position(s)
2:0 Ignored
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page directory during linear-
address translation (see Section 4.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page directory during linear-
address translation (see Section 4.9)
11:5 Ignored
31:12 Physical address of the 4-KByte aligned page directory used for linear-address translation
63:32 Ignored (these bits exist only on processors supporting the Intel-64 architecture)
4-10 Vol. 3A
PAGING
Table 4-4. Format of a 32-Bit Page-Directory Entry that Maps a 4-MByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-MByte page referenced by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-MByte page referenced by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-MByte page referenced by
this entry (see Section 4.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-MByte page referenced by
this entry (see Section 4.9)
5 (A) Accessed; indicates whether software has accessed the 4-MByte page referenced by this entry (see Section 4.8)
6 (D) Dirty; indicates whether software has written to the 4-MByte page referenced by this entry (see Section 4.8)
7 (PS) Page size; must be 1 (otherwise, this entry references a page table; see Table 4-5)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise
11:9 Ignored
12 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 4-MByte page referenced by this
entry (see Section 4.9.2); otherwise, reserved (must be 0)1
(M–20):13 Bits (M–1):32 of physical address of the 4-MByte page referenced by this entry2
31:22 Bits 31:22 of physical address of the 4-MByte page referenced by this entry
NOTES:
1. See Section 4.1.4 for how to determine whether the PAT is supported.
2. If the PSE-36 mechanism is not supported, M is 32, and this row does not apply. If the PSE-36 mechanism is supported, M is the min-
imum of 40 and MAXPHYADDR (this row does not apply if MAXPHYADDR = 32). See Section 4.1.4 for how to determine MAXPHYA-
DDR and whether the PSE-36 mechanism is supported.
Vol. 3A 4-11
PAGING
Table 4-5. Format of a 32-Bit Page-Directory Entry that References a Page Table
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-MByte region controlled by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-MByte region controlled by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 4.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 4.9)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 4.8)
6 Ignored
7 (PS) If CR4.PSE = 1, must be 0 (otherwise, this entry maps a 4-MByte page; see Table 4-4); otherwise, ignored
11:8 Ignored
31:12 Physical address of 4-KByte aligned page table referenced by this entry
Table 4-6. Format of a 32-Bit Page-Table Entry that Maps a 4-KByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 4.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 4.9)
5 (A) Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 4.8)
6 (D) Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 4.8)
7 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 4.9.2); otherwise, reserved (must be 0)1
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise
11:9 Ignored
NOTES:
1. See Section 4.1.4 for how to determine whether the PAT is supported.
4-12 Vol. 3A
PAGING
Bit Contents
Position(s)
4:0 Ignored
31:5 Physical address of the 32-Byte aligned page-directory-pointer table used for linear-address translation
63:32 Ignored (these bits exist only on processors supporting the Intel-64 architecture)
The page-directory-pointer-table comprises four (4) 64-bit entries called PDPTEs. Each PDPTE controls access to a
1-GByte region of the linear-address space. Corresponding to the PDPTEs, the logical processor maintains a set of
four (4) internal, non-architectural PDPTE registers, called PDPTE0, PDPTE1, PDPTE2, and PDPTE3. The logical
processor loads these registers from the PDPTEs in memory as part of certain operations:
• If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the
instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the
PDPTEs are loaded from the address in CR3.
• If MOV to CR3 is executed while the logical processor is using PAE paging, the PDPTEs are loaded from the
address being loaded into CR3.
• If PAE paging is in use and a task switch changes the value of CR3, the PDPTEs are loaded from the address in
the new CR3 value.
• Certain VMX transitions load the PDPTE registers. See Section 4.11.1.
Table 4-8 gives the format of a PDPTE. If any of the PDPTEs sets both the P flag (bit 0) and any reserved bit, the
MOV to CR instruction causes a general-protection exception (#GP(0)) and the PDPTEs are not loaded.2 As shown
in Table 4-8, bits 2:1, 8:5, and 63:MAXPHYADDR are reserved in the PDPTEs.
1. If MAXPHYADDR < 52, bits in the range 51:MAXPHYADDR will be 0 in any physical address used by PAE paging. (The corresponding
bits are reserved in the paging-structure entries.) See Section 4.1.4 for how to determine MAXPHYADDR.
2. On some processors, reserved bits are checked even in PDPTEs in which the P flag (bit 0) is 0.
Vol. 3A 4-13
PAGING
Bit Contents
Position(s)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 4.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 4.9)
11:9 Ignored
(M–1):12 Physical address of 4-KByte aligned page directory referenced by this entry1
NOTES:
1. M is an abbreviation for MAXPHYADDR, which is at most 52; see Section 4.1.4.
1. With PAE paging, the processor does not use CR3 when translating a linear address (as it does in the other paging modes). It does
not access the PDPTEs in the page-directory-pointer table during linear-address translation.
4-14 Vol. 3A
PAGING
Linear Address
31 30 29 21 20 12 11 0
Directory Pointer Directory Table Offset
12 4-KByte Page
PDPTE Registers
40
PDPTE value
1. See Section 4.1.4 for how to determine whether the PAT is supported.
Vol. 3A 4-15
PAGING
Linear Address
31 30 29 21 20 0
Directory Offset
Pointer Directory
21 2-MByte Page
9
Page Directory Physical Address
PDPTE Registers
2
PDE with PS=1
31
PDPTE value
40
Table 4-9. Format of a PAE Page-Directory Entry that Maps a 2-MByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte page referenced by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte page referenced by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 2-MByte page referenced by
this entry (see Section 4.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 2-MByte page referenced by this
entry (see Section 4.9)
5 (A) Accessed; indicates whether software has accessed the 2-MByte page referenced by this entry (see Section 4.8)
6 (D) Dirty; indicates whether software has written to the 2-MByte page referenced by this entry (see Section 4.8)
7 (PS) Page size; must be 1 (otherwise, this entry references a page table; see Table 4-10)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise
11:9 Ignored
12 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 2-MByte page referenced by this
entry (see Section 4.9.2); otherwise, reserved (must be 0)1
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte page controlled by
this entry; see Section 4.6); otherwise, reserved (must be 0)
NOTES:
1. See Section 4.1.4 for how to determine whether the PAT is supported.
4-16 Vol. 3A
PAGING
Table 4-10. Format of a PAE Page-Directory Entry that References a Page Table
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte region controlled by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte region controlled by this entry (see
Section 4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 4.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 4.9)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 4.8)
6 Ignored
7 (PS) Page size; must be 0 (otherwise, this entry maps a 2-MByte page; see Table 4-9)
11:8 Ignored
(M–1):12 Physical address of 4-KByte aligned page table referenced by this entry
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte region controlled
by this entry; see Section 4.6); otherwise, reserved (must be 0)
Table 4-11. Format of a PAE Page-Table Entry that Maps a 4-KByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by
this entry (see Section 4.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 4.9)
5 (A) Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 4.8)
6 (D) Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 4.8)
7 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 4.9.2); otherwise, reserved (must be 0)1
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise
Vol. 3A 4-17
PAGING
Table 4-11. Format of a PAE Page-Table Entry that Maps a 4-KByte Page (Contd.)
Bit Contents
Position(s)
11:9 Ignored
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by
this entry; see Section 4.6); otherwise, reserved (must be 0)
NOTES:
1. See Section 4.1.4 for how to determine whether the PAT is supported.
Figure 4-7 gives a summary of the formats of CR3 and the paging-structure entries with PAE paging. For the paging
structure entries, it identifies separately the format of entries that map pages, those that reference other paging
structures, and those that do neither because they are “not present”; bit 0 (P) and bit 7 (PS) are highlighted
because they determine how a paging-structure entry is used.
6666555555555 33322222222221111111111
M1 M-1
3210987654321 210987654321098765432109876543210
P P Rs
PDPTE:
Reserved3 Address of page directory Ign. Rsvd. CW 1
D T vd present
PDTPE:
Ignored 0 not
present
X P PPUR PDE:
Address of
D Reserved Reserved A Ign. G 1 D A C W /S / 1 2MB
2MB page frame
4 T DT W page
X I PPUR PDE:
D Reserved Address of page table Ign. 0 g A C W /S / 1 page
4 n DT W table
PDE:
Ignored 0 not
present
X P PPUR PTE:
D Reserved Address of 4KB page frame Ign. G A D A C W /S / 1 4KB
4 T DT W page
PTE:
Ignored 0 not
present
Figure 4-7. Formats of CR3 and Paging-Structure Entries with PAE Paging
NOTES:
1. M is an abbreviation for MAXPHYADDR.
2. CR3 has 64 bits only on processors supporting the Intel-64 architecture. These bits are ignored with PAE paging.
3. Reserved fields must be 0.
4. If IA32_EFER.NXE = 0 and the P flag of a PDE or a PTE is 1, the XD flag (bit 63) is reserved.
4-18 Vol. 3A
PAGING
Bit Contents
Position(s)
2:0 Ignored
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the PML4 table during linear-
address translation (see Section 4.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the PML4 table during linear-address
translation (see Section 4.9.2)
11:5 Ignored
M–1:12 Physical address of the 4-KByte aligned PML4 table used for linear-address translation1
NOTES:
1. M is an abbreviation for MAXPHYADDR, which is at most 52; see Section 4.1.4.
• Table 4-13 illustrates how CR3 is used with 4-level paging if CR4.PCIDE = 1.
Bit Contents
Position(s)
M–1:12 Physical address of the 4-KByte aligned PML4 table used for linear-address translation2
NOTES:
1. Section 4.9.2 explains how the processor determines the memory type used to access the PML4 table during linear-address transla-
tion with CR4.PCIDE = 1.
2. M is an abbreviation for MAXPHYADDR, which is at most 52; see Section 4.1.4.
3. See Section 4.10.4.1 for use of bit 63 of the source operand of the MOV to CR3 instruction.
After software modifies the value of CR4.PCIDE, the logical processor immediately begins using CR3 as specified
for the new value. For example, if software changes CR4.PCIDE from 1 to 0, the current PCID immediately changes
1. If MAXPHYADDR < 52, bits in the range 51:MAXPHYADDR will be 0 in any physical address used by 4-level paging. (The correspond-
ing bits are reserved in the paging-structure entries.) See Section 4.1.4 for how to determine MAXPHYADDR.
Vol. 3A 4-19
PAGING
from CR3[11:0] to 000H (see also Section 4.10.4.1). In addition, the logical processor subsequently determines
the memory type used to access the PML4 table using CR3.PWT and CR3.PCD, which had been bits 4:3 of the PCID.
4-level paging may map linear addresses to 4-KByte pages, 2-MByte pages, or 1-GByte pages.1 Figure 4-8 illus-
trates the translation process when it produces a 4-KByte page; Figure 4-9 covers the case of a 2-MByte page, and
Figure 4-10 the case of a 1-GByte page.
Linear Address
47 39 38 30 29 21 20 12 11 0
PML4 Directory Ptr Directory Table Offset
9 9
9 12 4-KByte Page
Physical Addr
PTE
Page-Directory- PDE with PS=0
40
Pointer Table 40 Page Table
Page-Directory
PDPTE 40
40
PML4E
40
CR3
4-20 Vol. 3A
PAGING
Linear Address
47 39 38 30 29 21 20 0
PML4 Directory Ptr Directory Offset
9 21
9
2-MByte Page
Physical Addr
Page-Directory- PDE with PS=1
Pointer Table 31
Page-Directory
PDPTE
40
9
40
PML4E
40
CR3
Linear Address
47 39 38 30 29 0
PML4 Directory Ptr Offset
30
9
1-GByte Page
Page-Directory-
Pointer Table
Physical Addr
PDPTE with PS=1
22
9
40
PML4E
40
CR3
Vol. 3A 4-21
PAGING
If CR4.PKE = 1, 4-level paging associates with each linear address a protection key. Section 4.6 explains how the
processor uses the protection key in its determination of the access rights of each linear address.
The following items describe the 4-level paging process in more detail as well has how the page size and protection
key are determined.
• A 4-KByte naturally aligned PML4 table is located at the physical address specified in bits 51:12 of CR3 (see
Table 4-12). A PML4 table comprises 512 64-bit entries (PML4Es). A PML4E is selected using the physical
address defined as follows:
— Bits 51:12 are from CR3.
— Bits 11:3 are bits 47:39 of the linear address.
— Bits 2:0 are all 0.
Because a PML4E is identified using bits 47:39 of the linear address, it controls access to a 512-GByte region of
the linear-address space.
• A 4-KByte naturally aligned page-directory-pointer table is located at the physical address specified in
bits 51:12 of the PML4E (see Table 4-14). A page-directory-pointer table comprises 512 64-bit entries
(PDPTEs). A PDPTE is selected using the physical address defined as follows:
— Bits 51:12 are from the PML4E.
— Bits 11:3 are bits 38:30 of the linear address.
— Bits 2:0 are all 0.
Because a PDPTE is identified using bits 47:30 of the linear address, it controls access to a 1-GByte region of the
linear-address space. Use of the PDPTE depends on its PS flag (bit 7):1
• If the PDPTE’s PS flag is 1, the PDPTE maps a 1-GByte page (see Table 4-15). The final physical address is
computed as follows:
— Bits 51:30 are from the PDPTE.
— Bits 29:0 are from the original linear address.
If CR4.PKE = 1, the linear address’s protection key is the value of bits 62:59 of the PDPTE.
• If the PDPTE’s PS flag is 0, a 4-KByte naturally aligned page directory is located at the physical address
specified in bits 51:12 of the PDPTE (see Table 4-16). A page directory comprises 512 64-bit entries (PDEs). A
PDE is selected using the physical address defined as follows:
— Bits 51:12 are from the PDPTE.
— Bits 11:3 are bits 29:21 of the linear address.
— Bits 2:0 are all 0.
Because a PDE is identified using bits 47:21 of the linear address, it controls access to a 2-MByte region of the
linear-address space. Use of the PDE depends on its PS flag:
• If the PDE's PS flag is 1, the PDE maps a 2-MByte page (see Table 4-17). The final physical address is computed
as follows:
— Bits 51:21 are from the PDE.
— Bits 20:0 are from the original linear address.
If CR4.PKE = 1, the linear address’s protection key is the value of bits 62:59 of the PDE.
• If the PDE’s PS flag is 0, a 4-KByte naturally aligned page table is located at the physical address specified in
bits 51:12 of the PDE (see Table 4-18). A page table comprises 512 64-bit entries (PTEs). A PTE is selected
using the physical address defined as follows:
— Bits 51:12 are from the PDE.
— Bits 11:3 are bits 20:12 of the linear address.
— Bits 2:0 are all 0.
1. The PS flag of a PDPTE is reserved and must be 0 (if the P flag is 1) if 1-GByte pages are not supported. See Section 4.1.4 for how
to determine whether 1-GByte pages are supported.
4-22 Vol. 3A
PAGING
• Because a PTE is identified using bits 47:12 of the linear address, every PTE maps a 4-KByte page (see
Table 4-19). The final physical address is computed as follows:
— Bits 51:12 are from the PTE.
— Bits 11:0 are from the original linear address.
If CR4.PKE = 1, the linear address’s protection key is the value of bits 62:59 of the PTE.
If a paging-structure entry’s P flag (bit 0) is 0 or if the entry sets any reserved bit, the entry is used neither to refer-
ence another paging-structure entry nor to map a page. There is no translation for a linear address whose transla-
tion would use such a paging-structure entry; a reference to such a linear address causes a page-fault exception
(see Section 4.7).
The following bits are reserved with 4-level paging:
• If the P flag of a paging-structure entry is 1, bits 51:MAXPHYADDR are reserved.
• If the P flag of a PML4E is 1, the PS flag is reserved.
• If 1-GByte pages are not supported and the P flag of a PDPTE is 1, the PS flag is reserved.1
• If the P flag and the PS flag of a PDPTE are both 1, bits 29:13 are reserved.
• If the P flag and the PS flag of a PDE are both 1, bits 20:13 are reserved.
• If IA32_EFER.NXE = 0 and the P flag of a paging-structure entry is 1, the XD flag (bit 63) is reserved.
A reference using a linear address that is successfully translated to a physical address is performed only if allowed
by the access rights of the translation; see Section 4.6.
Figure 4-11 gives a summary of the formats of CR3 and the 4-level paging-structure entries. For the paging struc-
ture entries, it identifies separately the format of entries that map pages, those that reference other paging struc-
tures, and those that do neither because they are “not present”; bit 0 (P) and bit 7 (PS) are highlighted because
they determine how a paging-structure entry is used.
Table 4-14. Format of a 4-Level PML4 Entry (PML4E) that References a Page-Directory-Pointer Table
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 512-GByte region controlled by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 512-GByte region controlled by this entry (see
Section 4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page-directory-pointer table
referenced by this entry (see Section 4.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page-directory-pointer table
referenced by this entry (see Section 4.9.2)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 4.8)
6 Ignored
11:8 Ignored
1. See Section 4.1.4 for how to determine whether 1-GByte pages are supported.
Vol. 3A 4-23
PAGING
Table 4-14. Format of a 4-Level PML4 Entry (PML4E) that References a Page-Directory-Pointer Table (Contd.)
Bit Contents
Position(s)
M–1:12 Physical address of 4-KByte aligned page-directory-pointer table referenced by this entry
62:52 Ignored
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 512-GByte region
controlled by this entry; see Section 4.6); otherwise, reserved (must be 0)
Table 4-15. Format of a 4-Level Page-Directory-Pointer-Table Entry (PDPTE) that Maps a 1-GByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 1-GByte page referenced by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 1-GByte page referenced by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 1-GByte page referenced by this
entry (see Section 4.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 1-GByte page referenced by this
entry (see Section 4.9.2)
5 (A) Accessed; indicates whether software has accessed the 1-GByte page referenced by this entry (see Section 4.8)
6 (D) Dirty; indicates whether software has written to the 1-GByte page referenced by this entry (see Section 4.8)
7 (PS) Page size; must be 1 (otherwise, this entry references a page directory; see Table 4-16)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise
11:9 Ignored
12 (PAT) Indirectly determines the memory type used to access the 1-GByte page referenced by this entry (see Section
4.9.2)1
58:52 Ignored
62:59 Protection key; if CR4.PKE = 1, determines the protection key of the page (see Section 4.6.2); ignored otherwise
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 1-GByte page controlled by
this entry; see Section 4.6); otherwise, reserved (must be 0)
NOTES:
1. The PAT is supported on all processors that support 4-level paging.
4-24 Vol. 3A
PAGING
Table 4-16. Format of a 4-Level Page-Directory-Pointer-Table Entry (PDPTE) that References a Page Directory
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 1-GByte region controlled by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 1-GByte region controlled by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 4.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 4.9.2)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 4.8)
6 Ignored
7 (PS) Page size; must be 0 (otherwise, this entry maps a 1-GByte page; see Table 4-15)
11:8 Ignored
(M–1):12 Physical address of 4-KByte aligned page directory referenced by this entry
62:52 Ignored
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 1-GByte region controlled
by this entry; see Section 4.6); otherwise, reserved (must be 0)
Table 4-17. Format of a 4-Level Page-Directory Entry that Maps a 2-MByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte page referenced by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte page referenced by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 2-MByte page referenced by
this entry (see Section 4.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 2-MByte page referenced by
this entry (see Section 4.9.2)
5 (A) Accessed; indicates whether software has accessed the 2-MByte page referenced by this entry (see Section 4.8)
6 (D) Dirty; indicates whether software has written to the 2-MByte page referenced by this entry (see Section 4.8)
7 (PS) Page size; must be 1 (otherwise, this entry references a page table; see Table 4-18)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise
Vol. 3A 4-25
PAGING
Table 4-17. Format of a 4-Level Page-Directory Entry that Maps a 2-MByte Page (Contd.)
Bit Contents
Position(s)
11:9 Ignored
12 (PAT) Indirectly determines the memory type used to access the 2-MByte page referenced by this entry (see Section
4.9.2)
58:52 Ignored
62:59 Protection key; if CR4.PKE = 1, determines the protection key of the page (see Section 4.6.2); ignored otherwise
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte page controlled by
this entry; see Section 4.6); otherwise, reserved (must be 0)
Table 4-18. Format of a 4-Level Page-Directory Entry that References a Page Table
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte region controlled by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte region controlled by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 4.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 4.9.2)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 4.8)
6 Ignored
7 (PS) Page size; must be 0 (otherwise, this entry maps a 2-MByte page; see Table 4-17)
11:8 Ignored
(M–1):12 Physical address of 4-KByte aligned page table referenced by this entry
62:52 Ignored
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte region controlled
by this entry; see Section 4.6); otherwise, reserved (must be 0)
4-26 Vol. 3A
PAGING
Table 4-19. Format of a 4-Level Page-Table Entry that Maps a 4-KByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 4.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section
4.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by
this entry (see Section 4.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 4.9.2)
5 (A) Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 4.8)
6 (D) Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 4.8)
7 (PAT) Indirectly determines the memory type used to access the 4-KByte page referenced by this entry (see Section 4.9.2)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise
11:9 Ignored
58:52 Ignored
62:59 Protection key; if CR4.PKE = 1, determines the protection key of the page (see Section 4.6.2); ignored otherwise
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by
this entry; see Section 4.6); otherwise, reserved (must be 0)
Vol. 3A 4-27
PAGING
6666555555555 33322222222221111111111
M1 M-1
3210987654321 210987654321098765432109876543210
PP
Reserved2 Address of PML4 table Ignored C W Ign. CR3
DT
X
Rs gI A P PUR PML4E:
D Ignored Rsvd. Address of page-directory-pointer table Ign. C W / 1 present
vd n D T /S W
3
PML4E:
Ignored 0 not
present
X P PP R PDPTE:
Prot. Address of
D Ignored Rsvd. Reserved A Ign. G 1 D A C W U / 1 1GB
3 Key4 1GB page frame T D T /S W page
X I PPUR PDPTE:
D Ignored Rsvd. Address of page directory Ign. 0 g A C W /S / 1 page
3 n DT W directory
PDTPE:
Ignored 0 not
present
X P PPUR PDE:
Prot. Address of
D Ignored Rsvd. Reserved A Ign. G 1 D A C W /S / 1 2MB
3 Key4 2MB page frame T DT W page
X I PPUR PDE:
D Ignored Rsvd. Address of page table Ign. 0 g A C W /S / 1 page
3 n DT W table
PDE:
Ignored 0 not
present
X P PPUR PTE:
Prot.
D Ignored Rsvd. Address of 4KB page frame Ign. G A D A C W /S / 1 4KB
3 Key4 T DT W page
PTE:
Ignored 0 not
present
Figure 4-11. Formats of CR3 and Paging-Structure Entries with 4-Level Paging
NOTES:
1. M is an abbreviation for MAXPHYADDR.
2. Reserved fields must be 0.
3. If IA32_EFER.NXE = 0 and the P flag of a paging-structure entry is 1, the XD flag (bit 63) is reserved.
4. If CR4.PKE = 0, the protection key is ignored.
4-28 Vol. 3A
PAGING
Section 4.6.1 describes how the processor determines the access rights for each linear address. Section 4.6.2
provides additional information about how protection keys contribute to access-rights determination. (They do so
only with 4-level paging and only if CR4.PKE = 1.)
Vol. 3A 4-29
PAGING
4-30 Vol. 3A
PAGING
Vol. 3A 4-31
PAGING
How a linear address’s protection key controls access to the address depends on the mode of the linear address:
• A linear address’s protection key controls only data accesses to the address. It does not in any way affect
instructions fetches from the address.
• The protection key of a supervisor-mode address is ignored and does not control data accesses to the address.
Because of this, Section 4.6.1 does not refer to protection keys when specifying the access rights for
supervisor-mode addresses.
• Use of the protection key i of a user-mode address depends on the value of the PKRU register:
— If ADi = 1, no data accesses are permitted.
— If WDi = 1, permission may be denied to certain data write accesses:
• User-mode write accesses are not permitted.
• Supervisor-mode write accesses are not permitted if CR0.WP = 1. (If CR0.WP = 0, WDi does not affect
supervisor-mode write accesses to user-mode addresses with protection key i.)
Protection keys apply to shadow-stack accesses just as they do to ordinary data accesses.
4-32 Vol. 3A
PAGING
Figure 4-12 illustrates the error code that the processor provides on delivery of a page-fault exception. The
following items explain how the bits in the error code describe the nature of the page-fault exception:
31 15 6 5 4 3 2 1 0
SGX
SS
PK
I/D
RSVD
U/S
W/R
P
Reserved Reserved
1. Some past processors had errata for some page faults that occur when there is no translation for the linear address because the P
flag was 0 in one of the paging-structure entries used to translate that address. Due to these errata, some such page faults pro-
duced error codes that cleared bit 0 (P flag) and set bit 3 (RSVD flag).
Vol. 3A 4-33
PAGING
NOTE
The accesses used by the processor to set these flags may or may not be exposed to the
processor’s self-modifying code detection logic. If the processor is executing code from the same
memory area that is being used for the paging structures, the setting of these flags may or may not
result in an immediate change to the executing code stream.
1. With PAE paging, the PDPTEs are not used during linear-address translation but only to load the PDPTE registers for some execu-
tions of the MOV CR instruction (see Section 4.4.1). For this reason, the PDPTEs do not contain accessed flags with PAE paging.
4-34 Vol. 3A
PAGING
4.9.1 Paging and Memory Typing When the PAT is Not Supported (Pentium Pro and
Pentium II Processors)
NOTE
The PAT is supported on all processors that support 4-level paging. Thus, this section applies only
to 32-bit paging and PAE paging.
If the PAT is not supported, paging contributes to memory typing in conjunction with the memory-type range regis-
ters (MTRRs) as specified in Table 11-6 in Section 11.5.2.1.
For any access to a physical address, the table combines the memory type specified for that physical address by
the MTRRs with a PCD value and a PWT value. The latter two values are determined as follows:
• For an access to a PDE with 32-bit paging, the PCD and PWT values come from CR3.
• For an access to a PDE with PAE paging, the PCD and PWT values come from the relevant PDPTE register.
• For an access to a PTE, the PCD and PWT values come from the relevant PDE.
• For an access to the physical address that is the translation of a linear address, the PCD and PWT values come
from the relevant PTE (if the translation uses a 4-KByte page) or the relevant PDE (otherwise).
• With PAE paging, the UC memory type is used when loading the PDPTEs (see Section 4.4.1).
4.9.2 Paging and Memory Typing When the PAT is Supported (Pentium III and More Recent
Processor Families)
If the PAT is supported, paging contributes to memory typing in conjunction with the PAT and the memory-type
range registers (MTRRs) as specified in Table 11-7 in Section 11.5.2.2.
The PAT is a 64-bit MSR (IA32_PAT; MSR index 277H) comprising eight (8) 8-bit entries (entry i comprises
bits 8i+7:8i of the MSR).
For any access to a physical address, the table combines the memory type specified for that physical address by
the MTRRs with a memory type selected from the PAT. Table 11-11 in Section 11.12.3 specifies how a memory type
is selected from the PAT. Specifically, it comes from entry i of the PAT, where i is defined as follows:
• For an access to an entry in a paging structure whose address is in CR3 (e.g., the PML4 table with 4-level
paging):
— For 4-level paging with CR4.PCIDE = 1, i = 0.
— Otherwise, i = 2*PCD+PWT, where the PCD and PWT values come from CR3.
• For an access to a PDE with PAE paging, i = 2*PCD+PWT, where the PCD and PWT values come from the
relevant PDPTE register.
• For an access to a paging-structure entry X whose address is in another paging-structure entry Y, i =
2*PCD+PWT, where the PCD and PWT values come from Y.
1. The PAT is supported on Pentium III and more recent processor families. See Section 4.1.4 for how to determine whether the PAT is
supported.
Vol. 3A 4-35
PAGING
• For an access to the physical address that is the translation of a linear address, i = 4*PAT+2*PCD+PWT, where
the PAT, PCD, and PWT values come from the relevant PTE (if the translation uses a 4-KByte page), the relevant
PDE (if the translation uses a 2-MByte page or a 4-MByte page), or the relevant PDPTE (if the translation uses
a 1-GByte page).
• With PAE paging, the WB memory type is used when loading the PDPTEs (see Section 4.4.1).1
1. Some older IA-32 processors used the UC memory type when loading the PDPTEs. Some processors may use the UC memory type if
CR0.CD = 1 or if the MTRRs are disabled. These behaviors are model-specific and not architectural.
4-36 Vol. 3A
PAGING
If CR4.PCIDE = 0, a logical processor does not cache information for any PCID other than 000H. This is because
(1) if CR4.PCIDE = 0, the logical processor will associate any newly cached information with the current PCID,
000H; and (2) if MOV to CR4 clears CR4.PCIDE, all cached information is invalidated (see Section 4.10.4.1).
NOTE
In revisions of this manual that were produced when no processors allowed CR4.PCIDE to be set to
1, Section 4.10 discussed the caching of translation information without any reference to PCIDs.
While the section now refers to PCIDs in its specification of this caching, this documentation change
is not intended to imply any change to the behavior of processors that do not allow CR4.PCIDE to
be set to 1.
Vol. 3A 4-37
PAGING
• The access rights from the paging-structure entries used to translate linear addresses with the page number
(see Section 4.6):
— The logical-AND of the R/W flags.
— The logical-AND of the U/S flags.
— The logical-OR of the XD flags (necessary only if IA32_EFER.NXE = 1).
— The protection key (necessary only with 4-level paging and CR4.PKE = 1).
• Attributes from a paging-structure entry that identifies the final page frame for the page number (either a PTE
or a paging-structure entry in which the PS flag is 1):
— The dirty flag (see Section 4.8).
— The memory type (see Section 4.9).
(TLB entries may contain other information as well. A processor may implement multiple TLBs, and some of these
may be for special purposes, e.g., only for instruction fetches. Such special-purpose TLBs may not contain some of
this information if it is not necessary. For example, a TLB used only for instruction fetches need not contain infor-
mation about the R/W and dirty flags.)
As noted in Section 4.10.1, any TLB entries created by a logical processor are associated with the current PCID.
Processors need not implement any TLBs. Processors that do implement TLBs may invalidate any TLB entry at any
time. Software should not rely on the existence of TLBs or on the retention of TLB entries.
4-38 Vol. 3A
PAGING
Because the G flag is used only in paging-structure entries that map a page, and because information from such
entries is not cached in the paging-structure caches, the global-page feature does not affect the behavior of the
paging-structure caches.
A logical processor may use a global TLB entry to translate a linear address, even if the TLB entry is associated with
a PCID different from the current PCID.
1. With PAE paging, the PDPTEs are stored in internal, non-architectural registers. The operation of these registers is described in Sec-
tion 4.4.1 and differs from that described here.
Vol. 3A 4-39
PAGING
— The processor does not create a PDPTE-cache entry unless the P flag is 1, the PS flag is 0, and the reserved
bits are 0 in the PML4E and the PDPTE in memory.
— The processor does not create a PDPTE-cache entry unless the accessed flags are 1 in the PML4E and the
PDPTE in memory; before caching a translation, the processor sets any accessed flags that are not already
1.
— The processor may create a PDPTE-cache entry even if there are no translations for any linear address that
might use that entry.
— If the processor creates a PDPTE-cache entry, the processor may retain it unmodified even if software
subsequently modifies the corresponding PML4E or PDPTE in memory.
• PDE cache. The use of the PDE cache depends on the paging mode:
— For 32-bit paging, each PDE-cache entry is referenced by a 10-bit value and is used for linear addresses for
which bits 31:22 have that value.
— For PAE paging, each PDE-cache entry is referenced by an 11-bit value and is used for linear addresses for
which bits 31:21 have that value.
— For 4-level paging, each PDE-cache entry is referenced by a 27-bit value and is used for linear addresses for
which bits 47:21 have that value.
A PDE-cache entry contains information from the PML4E, PDPTE, and PDE used to translate the relevant linear
addresses (for 32-bit paging and PAE paging, only the PDE applies):
— The physical address from the PDE (the address of the page table). (No PDE-cache entry is created for a
PDE that maps a page.)
— The logical-AND of the R/W flags in the PML4E, PDPTE, and PDE.
— The logical-AND of the U/S flags in the PML4E, PDPTE, and PDE.
— The logical-OR of the XD flags in the PML4E, PDPTE, and PDE.
— The values of the PCD and PWT flags of the PDE.
The following items detail how a processor may use the PDE cache (references below to PML4Es and PDPTEs
apply only to 4-level paging):
— If the processor has a PDE-cache entry for a linear address, it may use that entry when translating the
linear address (instead of the PML4E, the PDPTE, and the PDE in memory).
— The processor does not create a PDE-cache entry unless the P flag is 1, the PS flag is 0, and the reserved
bits are 0 in the PML4E, the PDPTE, and the PDE in memory.
— The processor does not create a PDE-cache entry unless the accessed flag is 1 in the PML4E, the PDPTE, and
the PDE in memory; before caching a translation, the processor sets any accessed flags that are not already
1.
— The processor may create a PDE-cache entry even if there are no translations for any linear address that
might use that entry.
— If the processor creates a PDE-cache entry, the processor may retain it unmodified even if software subse-
quently modifies the corresponding PML4E, the PDPTE, or the PDE in memory.
Information from a paging-structure entry can be included in entries in the paging-structure caches for other
paging-structure entries referenced by the original entry. For example, if the R/W flag is 0 in a PML4E, then the R/W
flag will be 0 in any PDPTE-cache entry for a PDPTE from the page-directory-pointer table referenced by that
PML4E. This is because the R/W flag of each such PDPTE-cache entry is the logical-AND of the R/W flags in the
appropriate PML4E and PDPTE.
The paging-structure caches contain information only from paging-structure entries that reference other paging
structures (and not those that map pages). Because the G flag is not used in such paging-structure entries, the
global-page feature does not affect the behavior of the paging-structure caches.
The processor may create entries in paging-structure caches for translations required for prefetches and for
accesses that are a result of speculative execution that would never actually occur in the executed code path.
As noted in Section 4.10.1, any entries created in paging-structure caches by a logical processor are associated
with the current PCID.
4-40 Vol. 3A
PAGING
A processor may or may not implement any of the paging-structure caches. Software should rely on neither their
presence nor their absence. The processor may invalidate entries in these caches at any time. Because the
processor may create the cache entries at the time of translation and not update them following subsequent modi-
fications to the paging structures in memory, software should take care to invalidate the cache entries appropri-
ately when causing such modifications. The invalidation of TLBs and the paging-structure caches is described in
Section 4.10.4.
Vol. 3A 4-41
PAGING
— Any PDE-cache entry associated with linear addresses with 0 in bits 47:21 contains address X for similar
reasons.
— Any TLB entry for page number 0 (associated with linear addresses with 0 in bits 47:12) translates to page
frame X » 12 for similar reasons.
The same PML4E contributes its address X to all these cache entries because the self-referencing nature of the
entry causes it to be used as a PML4E, a PDPTE, a PDE, and a PTE.
1. If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page
(see Section 4.10.2.3), the instruction invalidates all of them.
2. If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page
(see Section 4.10.2.3), the instruction invalidates all of them.
4-42 Vol. 3A
PAGING
— If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 1, the instruction is not required to
invalidate any TLB entries or entries in paging-structure caches.
• MOV to CR4. The behavior of the instruction depends on the bits being modified:
— The instruction invalidates all TLB entries (including global entries) and all entries in all paging-structure
caches (for all PCIDs) if (1) it changes the value of CR4.PGE;1 or (2) it changes the value of the CR4.PCIDE
from 1 to 0.
— The instruction invalidates all TLB entries and all entries in all paging-structure caches for the current PCID
if (1) it changes the value of CR4.PAE; or (2) it changes the value of CR4.SMEP from 0 to 1.
• Task switch. If a task switch changes the value of CR3, it invalidates all TLB entries associated with PCID 000H
except those for global pages. It also invalidates all entries in all paging-structure caches associated with PCID
000H.2
• VMX transitions. See Section 4.11.1.
The processor is always free to invalidate additional entries in the TLBs and paging-structure caches. The following
are some examples:
• INVLPG may invalidate TLB entries for pages other than the one corresponding to its linear-address operand. It
may invalidate TLB entries and paging-structure-cache entries associated with PCIDs other than the current
PCID.
• INVPCID may invalidate TLB entries for pages other than the one corresponding to the specified linear address.
It may invalidate TLB entries and paging-structure-cache entries associated with PCIDs other than the
specified PCID.
• MOV to CR0 may invalidate TLB entries even if CR0.PG is not changing. For example, this may occur if either
CR0.CD or CR0.NW is modified.
• MOV to CR3 may invalidate TLB entries for global pages. If CR4.PCIDE = 1 and bit 63 of the instruction’s source
operand is 0, it may invalidate TLB entries and entries in the paging-structure caches associated with PCIDs
other than the PCID it is establishing. It may invalidate entries if CR4.PCIDE = 1 and bit 63 of the instruction’s
source operand is 1.
• MOV to CR4 may invalidate TLB entries when changing CR4.PSE or when changing CR4.SMEP from 1 to 0.
• On a processor supporting Hyper-Threading Technology, invalidations performed on one logical processor may
invalidate entries in the TLBs and paging-structure caches used by other logical processors.
(Other instructions and operations may invalidate entries in the TLBs and the paging-structure caches, but the
instructions identified above are recommended.)
In addition to the instructions identified above, page faults invalidate entries in the TLBs and paging-structure
caches. In particular, a page-fault exception resulting from an attempt to use a linear address will invalidate any
TLB entries that are for a page number corresponding to that linear address and that are associated with the
current PCID. It also invalidates all entries in the paging-structure caches that would be used for that linear address
and that are associated with the current PCID.3 These invalidations ensure that the page-fault exception will not
recur (if the faulting instruction is re-executed) if it would not be caused by the contents of the paging structures
in memory (and if, therefore, it resulted from cached entries that were not invalidated after the paging structures
were modified in memory).
As noted in Section 4.10.2, some processors may choose to cache multiple smaller-page TLB entries for a transla-
tion specified by the paging structures to use a page larger than 4 KBytes. There is no way for software to be aware
that multiple translations for smaller pages have been used for a large page. The INVLPG instruction and page
faults provide the same assurances that they provide when a single TLB entry is used: they invalidate all TLB
entries corresponding to the translation specified by the paging structures.
1. If CR4.PGE is changing from 0 to 1, there were no global TLB entries before the execution; if CR4.PGE is changing from 1 to 0, there
will be no global TLB entries after the execution.
2. Task switches do not occur in IA-32e mode and thus cannot occur with 4-level paging. Since CR4.PCIDE can be set only with 4-level
paging, task switches occur only with CR4.PCIDE = 0.
3. Unlike INVLPG, page faults need not invalidate all entries in the paging-structure caches, only those that would be used to translate
the faulting linear address.
Vol. 3A 4-43
PAGING
1. One execution of INVLPG is sufficient even for a page with size greater than 4 KBytes.
4-44 Vol. 3A
PAGING
1. If it is also the case that no invalidation was performed the last time the P flag was changed from 1 to 0, the processor may use a
TLB entry or paging-structure cache entry that was created when the P flag had earlier been 1.
Vol. 3A 4-45
PAGING
• If a paging-structure entry is modified to change the R/W flag from 0 to 1, write accesses to linear addresses
whose translation is controlled by this entry may or may not cause a page-fault exception.
• If a paging-structure entry is modified to change the U/S flag from 0 to 1, user-mode accesses to linear
addresses whose translation is controlled by this entry may or may not cause a page-fault exception.
• If a paging-structure entry is modified to change the XD flag from 1 to 0, instruction fetches from linear
addresses whose translation is controlled by this entry may or may not cause a page-fault exception.
As noted in Section 8.1.1, an x87 instruction or an SSE instruction that accesses data larger than a quadword may
be implemented using multiple memory accesses. If such an instruction stores to memory and invalidation has
been delayed, some of the accesses may complete (writing to memory) while another causes a page-fault excep-
tion.1 In this case, the effects of the completed accesses may be visible to software even though the overall instruc-
tion caused a fault.
In some cases, the consequences of delayed invalidation may not affect software adversely. For example, when
freeing a portion of the linear-address space (by marking paging-structure entries “not present”), invalidation
using INVLPG may be delayed if software does not re-allocate that portion of the linear-address space or the
memory that had been associated with it. However, because of speculative execution (or errant software), there
may be accesses to the freed portion of the linear-address space before the invalidations occur. In this case, the
following can happen:
• Reads can occur to the freed portion of the linear-address space. Therefore, invalidation should not be delayed
for an address range that has read side effects.
• The processor may retain entries in the TLBs and paging-structure caches for an extended period of time.
Software should not assume that the processor will not use entries associated with a linear address simply
because time has passed.
• As noted in Section 4.10.3.1, the processor may create an entry in a paging-structure cache even if there are
no translations for any linear address that might use that entry. Thus, if software has marked “not present” all
entries in a page table, the processor may subsequently create a PDE-cache entry for the PDE that references
that page table (assuming that the PDE itself is marked “present”).
• If software attempts to write to the freed portion of the linear-address space, the processor might not generate
a page fault. (Such an attempt would likely be the result of a software error.) For that reason, the page frames
previously associated with the freed portion of the linear-address space should not be reallocated for another
purpose until the appropriate invalidations have been performed.
1. If the accesses are to different pages, this may occur even if invalidation has not been delayed.
4-46 Vol. 3A
PAGING
• All logical processors that are using the paging structures that are being modified must participate and perform
appropriate invalidations after the modifications are made.
• If the modifications to the paging-structure entries are made before the barrier or if there is no barrier, the
operating system must ensure one of the following: (1) that the affected linear-address range is not used
between the time of modification and the time of invalidation; or (2) that it is prepared to deal with the conse-
quences of the affected linear-address range being used during that period. For example, if the operating
system does not allow pages being freed to be reallocated for another purpose until after the required invalida-
tions, writes to those pages by errant software will not unexpectedly modify memory that is in use.
• Software must be prepared to deal with reads, instruction fetches, and prefetch requests to the affected linear-
address range that are a result of speculative execution that would never actually occur in the executed code
path.
When multiple logical processors are using the same linear-address space at the same time, they must coordinate
before any request to modify the paging-structure entries that control that linear-address space. In these cases,
the barrier in the TLB shootdown routine may not be required. For example, when freeing a range of linear
addresses, some other mechanism can assure no logical processor is using that range before the request to free it
is made. In this case, a logical processor freeing the range can clear the P flags in the PTEs associated with the
range, free the physical page frames associated with the range, and then signal the other logical processors using
that linear-address space to perform the necessary invalidations. All the affected logical processors must complete
their invalidations before the linear-address range and the physical page frames previously associated with that
range can be reallocated.
Vol. 3A 4-47
PAGING
VPIDs provide a way for software to identify to the processor the address spaces for different “virtual processors.”
The processor may use this identification to maintain concurrently information for multiple address spaces in its
TLBs and paging-structure caches, even when non-zero PCIDs are not being used. See Section 28.1 for details.
When EPT is in use, the addresses in the paging-structures are not used as physical addresses to access memory
and memory-mapped I/O. Instead, they are treated as guest-physical addresses and are translated through a set
of EPT paging structures to produce physical addresses. EPT can also specify its own access rights and memory
typing; these are used on conjunction with those specified in this chapter. See Section 28.2 for more information.
Both VPIDs and EPT may change the way that a processor maintains information in TLBs and paging structure
caches and the ways in which software can manage that information. Some of the behaviors documented in Section
4.10 may change. See Section 28.3 for details.
4-48 Vol. 3A
PAGING
One approach to combining paging and segmentation that simplifies memory-management software is to give
each segment its own page table, as shown in Figure 4-13. This convention gives the segment a single entry in the
page directory, and this entry provides the access control information for paging the entire segment.
Page Frames
PTE
PTE
PTE
Seg. Descript. PDE
Seg. Descript. PDE
PTE
PTE
Vol. 3A 4-49
PAGING
4-50 Vol. 3A
12.Updates to Chapter 6, Volume 3A
Change bars and green text show changes to Chapter 6 of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter: Added new Interrupt 21—Control Protection Exception (#CP) and other Control-flow
Enforcement Technology details.
This chapter describes the interrupt and exception-handling mechanism when operating in protected mode on an
Intel 64 or IA-32 processor. Most of the information provided here also applies to interrupt and exception mecha-
nisms used in real-address, virtual-8086 mode, and 64-bit mode.
Chapter 20, “8086 Emulation,” describes information specific to interrupt and exception mechanisms in real-
address and virtual-8086 mode. Section 6.14, “Exception and Interrupt Handling in 64-bit Mode,” describes infor-
mation specific to interrupt and exception mechanisms in IA-32e mode and 64-bit sub-mode.
Vol. 3A 6-1
INTERRUPT AND EXCEPTION HANDLING
Table 6-1 shows vector number assignments for architecturally defined exceptions and for the NMI interrupt. This
table gives the exception type (see Section 6.5, “Exception Classifications”) and indicates whether an error code is
saved on the stack for the exception. The source of each predefined exception and the NMI interrupt is also given.
6-2 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
The processor’s local APIC is normally connected to a system-based I/O APIC. Here, external interrupts received at
the I/O APIC’s pins can be directed to the local APIC through the system bus (Pentium 4, Intel Core Duo, Intel Core
2, Intel Atom®, and Intel Xeon processors) or the APIC serial bus (P6 family and Pentium processors). The I/O APIC
determines the vector number of the interrupt and sends this number to the local APIC. When a system contains
multiple processors, processors can also send interrupts to one another by means of the system bus (Pentium 4,
Intel Core Duo, Intel Core 2, Intel Atom, and Intel Xeon processors) or the APIC serial bus (P6 family and Pentium
processors).
The LINT[1:0] pins are not available on the Intel486 processor and earlier Pentium processors that do not contain
an on-chip local APIC. These processors have dedicated NMI and INTR pins. With these processors, external inter-
rupts are typically generated by a system-based interrupt controller (8259A), with the interrupts being signaled
through the INTR pin.
Note that several other pins on the processor can cause a processor interrupt to occur. However, these interrupts
are not handled by the interrupt and exception mechanism described in this chapter. These pins include the
RESET#, FLUSH#, STPCLK#, SMI#, R/S#, and INIT# pins. Whether they are included on a particular processor is
implementation dependent. Pin functions are described in the data books for the individual processors. The SMI#
pin is described in Chapter 34, “System Management Mode.”
Vol. 3A 6-3
INTERRUPT AND EXCEPTION HANDLING
The IF flag in the EFLAGS register permits all maskable hardware interrupts to be masked as a group (see Section
6.8.1, “Masking Maskable Hardware Interrupts”). Note that when interrupts 0 through 15 are delivered through the
local APIC, the APIC indicates the receipt of an illegal vector.
1. The INT n instruction has opcode CD following by an immediate byte encoding the value of n. In contrast, INT1 has opcode F1 and
INT3 has opcode CC.
6-4 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
See Chapter 6, “Interrupt 18—Machine-Check Exception (#MC)” and Chapter 15, “Machine-Check Architecture,”
for more information about the machine-check mechanism.
NOTE
One exception subset normally reported as a fault is not restartable. Such exceptions result in loss
of some processor state. For example, executing a POPAD instruction where the stack frame
crosses over the end of the stack segment causes a fault to be reported. In this situation, the
exception handler sees that the instruction pointer (CS:EIP) has been restored as if the POPAD
instruction had not been executed. However, internal processor state (the general-purpose
registers) will have been modified. Such cases are considered programming errors. An application
causing this class of exceptions should be terminated by the operating system.
Vol. 3A 6-5
INTERRUPT AND EXCEPTION HANDLING
Interrupts rigorously support restarting of interrupted programs and tasks without loss of continuity. The return
instruction pointer saved for an interrupt points to the next instruction to be executed at the instruction boundary
where the processor took the interrupt. If the instruction just executed has a repeat prefix, the interrupt is taken
at the end of the current iteration with the registers set to execute the next iteration.
The ability of a P6 family processor to speculatively execute instructions does not affect the taking of interrupts by
the processor. Interrupts are taken at instruction boundaries located during the retirement phase of instruction
execution; so they are always taken in the “in-order” instruction stream. See Chapter 2, “Intel® 64 and IA-32
Architectures,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for more informa-
tion about the P6 family processors’ microarchitecture and its support for out-of-order instruction execution.
Note that the Pentium processor and earlier IA-32 processors also perform varying amounts of prefetching and
preliminary decoding. With these processors as well, exceptions and interrupts are not signaled until actual “in-
order” execution of the instructions. For a given code sample, the signaling of exceptions occurs uniformly when
the code is executed on any family of IA-32 processors (except where new exceptions or new opcodes have been
defined).
6-6 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
2. The effect of the IOPL on these instructions is modified slightly when the virtual mode extension is enabled by setting the VME flag
in control register CR4: see Section 20.3, “Interrupt and Exception Handling in Virtual-8086 Mode.” Behavior is also impacted by the
PVI flag: see Section 20.4, “Protected-Mode Virtual Interrupts.”
3. Nonmaskable interrupts and system-management interrupts may also be inhibited on the instruction boundary following such an
execution of STI.
Vol. 3A 6-7
INTERRUPT AND EXCEPTION HANDLING
6-8 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
While priority among these classes listed in Table 6-2 is consistent throughout the architecture, exceptions within
each class are implementation-dependent and may vary from processor to processor. The processor first services
a pending exception or interrupt from the class which has the highest priority, transferring execution to the first
instruction of the handler. Lower priority exceptions are discarded; lower priority interrupts are held pending.
Discarded exceptions are re-generated when the interrupt handler returns execution to the point in the program or
task where the exceptions and/or interrupts occurred.
Vol. 3A 6-9
INTERRUPT AND EXCEPTION HANDLING
The LIDT (load IDT register) and SIDT (store IDT register) instructions load and store the contents of the IDTR
register, respectively. The LIDT instruction loads the IDTR register with the base address and limit held in a memory
operand. This instruction can be executed only when the CPL is 0. It normally is used by the initialization code of an
operating system when creating an IDT. An operating system also may use it to change from one IDT to another.
The SIDT instruction copies the base and limit value stored in IDTR to memory. This instruction can be executed at
any privilege level.
If a vector references a descriptor beyond the limit of the IDT, a general-protection exception (#GP) is generated.
NOTE
Because interrupts are delivered to the processor core only once, an incorrectly configured IDT
could result in incomplete interrupt handling and/or the blocking of interrupt delivery.
IA-32 architecture rules need to be followed for setting up IDTR base/limit/access fields and each
field in the gate descriptors. The same apply for the Intel 64 architecture. This includes implicit
referencing of the destination code segment through the GDT or LDT and accessing the stack.
IDTR Register
47 16 15 0
IDT Base Address IDT Limit
Interrupt
Gate for
Interrupt #3 16
Gate for
Interrupt #2 8
Gate for
Interrupt #1 0
31 0
6-10 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Task Gate
31 16 15 14 13 12 8 7 0
D
P P 0 0 1 0 1 4
L
31 16 15 0
Interrupt Gate
31 16 15 14 13 12 8 7 5 4 0
D
Offset 31..16 P P 0 D 1 1 0 0 0 0 4
L
31 16 15 0
Trap Gate
31 16 15 14 13 12 8 7 5 4 0
D
Offset 31..16 P P 0 D 1 1 1 0 0 0 4
L
31 16 15 0
Vol. 3A 6-11
INTERRUPT AND EXCEPTION HANDLING
Destination
IDT Code Segment
Interrupt
Offset Procedure
Interrupt
Vector
Interrupt or
Trap Gate
+
Segment Selector
GDT or LDT
Base
Address
Segment
Descriptor
6-12 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
ESP Before
EFLAGS Transfer to Handler
CS
EIP
Error Code ESP After
Transfer to Handler
ESP Before
Transfer to Handler SS
ESP
EFLAGS
CS
EIP
ESP After Error Code
Transfer to Handler
To return from an exception- or interrupt-handler procedure, the handler must use the IRET (or IRETD) instruction.
The IRET instruction is similar to the RET instruction except that it restores the saved flags into the EFLAGS
register. The IOPL field of the EFLAGS register is restored only if the CPL is 0. The IF flag is changed only if the CPL
is less than or equal to the IOPL. See Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 2A, for a description of the complete operation performed by the
IRET instruction.
If a stack switch occurred when calling the handler procedure, the IRET instruction switches back to the interrupted
procedure’s stack on the return.
Vol. 3A 6-13
INTERRUPT AND EXCEPTION HANDLING
6.12.1.1 Shadow Stack Usage on Transfers to Interrupt and Exception Handling Routines
When the processor performs a call to the exception- or interrupt-handler procedure:
• If the handler procedure is going to be execute at a numerically lower privilege level, a shadow stack switch
occurs. When the shadow stack switch occurs:
a. On a transfer from privilege level 3, if shadow stacks are enabled at privilege level 3 then the SSP is saved
to the IA32_PL3_SSP MSR.
b. If shadow stacks are enabled at the privilege level where the handler will execute then the shadow stack for
the handler is obtained from one of the following MSRs based on the privilege level at which the handler
executes.
• IA32_PL2_SSP if handler executes at privilege level 2.
• IA32_PL1_SSP if handler executes at privilege level 1.
• IA32_PL0_SSP if handler executes at privilege level 0.
c. The SSP obtained is then verified to ensure it points to a valid supervisory shadow stack that is not currently
active by verifying a supervisor shadow stack token at the address pointed to by the SSP. The operations
performed to verify and acquire the supervisor shadow stack token by making it busy are as described in
Section 18.2.3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
d. On this new shadow stack, the processor pushes the CS, LIP (CS.base + EIP), and SSP of the interrupted
procedure if the interrupted procedure was executing at privilege level less than 3; see Figure 6-5.
• If the handler procedure is going to be executed at the same privilege level as the interrupted procedure and
shadow stacks are enabled at current privilege level:
a. The processor saves the current state of the CS, LIP (CS.base + EIP), and SSP registers on the current
shadow stack; see Figure 6-5.
6-14 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
SSP Before
Transfer to Handler
CS
LIP
SSP Before
Transfer to Handler
Supervisor
Shadow Stack
SSP After
Token
Transfer to Handler
SSP Before
Transfer to Handler
Supervisor
Shadow Stack
Token
CS
LIP
Figure 6-5. Shadow Stack Usage on Transfers to Interrupt and Exception-Handling Routines
To return from an exception- or interrupt-handler procedure, the handler must use the IRET (or IRETD) instruction.
When executing a return from an interrupt or exception handler from the same privilege level as the interrupted
procedure, the processor performs these actions to enforce return address protection:
• Restores the CS and EIP registers to their values prior to the interrupt or exception.
Vol. 3A 6-15
INTERRUPT AND EXCEPTION HANDLING
4. This check is not performed by execution of the INT1 instruction (opcode F1); it would be performed by execution of INT 1 (opcode
CD 01).
6-16 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Vol. 3A 6-17
INTERRUPT AND EXCEPTION HANDLING
NOTE
Because IA-32 architecture tasks are not re-entrant, an interrupt-handler task must disable
interrupts between the time it completes handling the interrupt and the time it executes the IRET
instruction. This action prevents another interrupt from occurring while the interrupt task’s TSS is
still marked busy, which would cause a general-protection (#GP) exception.
Interrupt
Vector Task Gate
TSS Descriptor
5. The bit is also set if the exception occurred during delivery of INT1.
6-18 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
31 3 2 1 0
T I E
Reserved Segment Selector Index I D X
T T
The segment selector index field provides an index into the IDT, GDT, or current LDT to the segment or gate
selector being referenced by the error code. In some cases the error code is null (all bits are clear except possibly
EXT). A null error code indicates that the error was not caused by a reference to a specific segment or that a null
segment selector was referenced in an operation.
The format of the error code is different for page-fault exceptions (#PF). See the “Interrupt 14—Page-Fault Excep-
tion (#PF)” section in this chapter.
The format of the error code is different for control protection exceptions (#CP). See the “Interrupt 21—Control
Protection Exception (#CP)” section in this chapter.
The error code is pushed on the stack as a doubleword or word (depending on the default interrupt, trap, or task
gate size). To keep the stack aligned for doubleword pushes, the upper half of the error code is reserved. Note that
the error code is not popped when the IRET instruction is executed to return from an exception handler, so the
handler must remove the error code before executing a return.
Error codes are not pushed on the stack for exceptions that are generated externally (with the INTR or LINT[1:0]
pins) or the INT n instruction, even if an error code is normally produced for those exceptions.
Vol. 3A 6-19
INTERRUPT AND EXCEPTION HANDLING
Interrupt/Trap Gate
31 0
Reserved 12
31 0
Offset 63..32 8
31 16 15 14 13 12 11 8 7 5 4 2 0
D
Offset 31..16 P P 0 TYPE 0 0 0 0 0 IST 4
L
31 16 15 0
In 64-bit mode, the IDT index is formed by scaling the interrupt vector by 16. The first eight bytes (bytes 7:0) of a
64-bit mode interrupt gate are similar but not identical to legacy 32-bit interrupt gates. The type field (bits 11:8 in
bytes 7:4) is described in Table 3-2. The Interrupt Stack Table (IST) field (bits 4:0 in bytes 7:4) is used by the stack
switching mechanisms described in Section 6.14.5, “Interrupt Stack Table.” Bytes 11:8 hold the upper 32 bits of
the target RIP (interrupt segment offset) in canonical form. A general-protection exception (#GP) is generated if
software attempts to reference an interrupt gate with a target RIP that is not in canonical form.
The target code segment referenced by the interrupt gate must be a 64-bit code segment (CS.L = 1, CS.D = 0). If
the target is not a 64-bit code segment, a general-protection exception (#GP) is generated with the IDT vector
number reported as the error code.
Only 64-bit interrupt and trap gates can be referenced in IA-32e mode (64-bit mode and compatibility mode).
Legacy 32-bit interrupt or trap gate types (0EH or 0FH) are redefined in IA-32e mode as 64-bit interrupt and trap
gate types. No 32-bit interrupt or trap gate type exists in IA-32e mode. If a reference is made to a 16-bit interrupt
or trap gate (06H or 07H), a general-protection exception (#GP(0)) is generated.
6-20 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Aligning the stack permits exception and interrupt frames to be aligned on a 16-byte boundary before interrupts
are re-enabled. This allows the stack to be formatted for optimal storage of 16-byte XMM registers, which enables
the interrupt handler to use faster 16-byte aligned loads and stores (MOVAPS rather than MOVUPS) to save and
restore XMM registers.
Although the RSP alignment is always performed when LMA = 1, it is only of consequence for the kernel-mode case
where there is no stack switch or IST used. For a stack switch or IST, the OS would have presumably put suitably
aligned RSP values in the TSS.
Vol. 3A 6-21
INTERRUPT AND EXCEPTION HANDLING
+20 SS SS +40
+16 ESP RSP +32
+12 EFLAGS RFLAGS +24
+8 CS CS +16
+4 EIP RIP +8
0 Error Code Stack Pointer After Error Code 0
Transfer to Handler
Figure 6-9. IA-32e Mode Stack Usage After Privilege Level Change
6-22 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
IST7 SSP
56
IST6 SSP
48
IST5 SSP
40
IST4 SSP
32
IST3 SSP
24
IST2 SSP
16
IST1 SSP
8
Not used; available
0
IA32_INTERRUPT_SSP_TABLE
Vol. 3A 6-23
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates the divisor operand for a DIV or IDIV instruction is 0 or that the result cannot be represented in the
number of bits specified for the destination operand.
6-24 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Exception Class Trap or Fault. The exception handler can distinguish between traps or faults by exam-
ining the contents of DR6 and the other debug registers.
Description
Indicates that one or more of several debug-exception conditions has been detected. Whether the exception is a
fault or a trap depends on the condition (see Table 6-3). See Chapter 17, “Debug, Branch Profile, TSC, and Intel®
Resource Director Technology (Intel® RDT) Features,” for detailed information about the debug exceptions.
Vol. 3A 6-25
INTERRUPT AND EXCEPTION HANDLING
and then delivers a #DB. See Section 16.3.7, “RTM-Enabled Debugger Support,” of Intel® 64 and IA-32 Architec-
tures Software Developer’s Manual, Volume 1.
6-26 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
The nonmaskable interrupt (NMI) is generated externally by asserting the processor’s NMI pin or through an NMI
request set by the I/O APIC to the local APIC. This interrupt causes the NMI interrupt handler to be called.
Vol. 3A 6-27
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that a breakpoint instruction (INT3, opcode CC) was executed, causing a breakpoint trap to be gener-
ated. Typically, a debugger sets a breakpoint by replacing the first opcode byte of an instruction with the opcode for
the INT3 instruction. (The INT3 instruction is one byte long, which makes it easy to replace an opcode in a code
segment in RAM with the breakpoint opcode.) The operating system or a debugging tool can use a data segment
mapped to the same physical address space as the code segment to place an INT3 instruction in places where it is
desired to call the debugger.
With the P6 family, Pentium, Intel486, and Intel386 processors, it is more convenient to set breakpoints with the
debug registers. (See Section 17.3.2, “Breakpoint Exception (#BP)—Interrupt Vector 3,” for information about the
breakpoint exception.) If more breakpoints are needed beyond what the debug registers allow, the INT3 instruction
can be used.
Any breakpoint exception inside an RTM region causes a transactional abort and, by default, redirects control flow
to the fallback instruction address. If advanced debugging of RTM transactional regions has been enabled, any
transactional abort due to a break exception instead causes execution to roll back to just before the XBEGIN
instruction and then delivers a debug exception (#DB) — not a breakpoint exception. See Section 16.3.7, “RTM-
Enabled Debugger Support,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
A breakpoint exception can also be generated by executing the INT n instruction with an operand of 3. The action
of this instruction (INT 3) is slightly different than that of the INT3 instruction (see “INT n/INTO/INT3/INT1—Call to
Interrupt Procedure” in Chapter 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
2A).
6-28 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that an overflow trap occurred when an INTO instruction was executed. The INTO instruction checks the
state of the OF flag in the EFLAGS register. If the OF flag is set, an overflow trap is generated.
Some arithmetic instructions (such as the ADD and SUB) perform both signed and unsigned arithmetic. These
instructions set the OF and CF flags in the EFLAGS register to indicate signed overflow and unsigned overflow,
respectively. When performing arithmetic on signed operands, the OF flag can be tested directly or the INTO
instruction can be used. The benefit of using the INTO instruction is that if the overflow exception is detected, an
exception handler can be called automatically to handle the overflow condition.
Vol. 3A 6-29
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that a BOUND-range-exceeded fault occurred when a BOUND instruction was executed. The BOUND
instruction checks that a signed array index is within the upper and lower bounds of an array located in memory. If
the array index is not within the bounds of the array, a BOUND-range-exceeded fault is generated.
6-30 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor did one of the following things:
• Attempted to execute an invalid or reserved opcode.
• Attempted to execute an instruction with an operand type that is invalid for its accompanying opcode; for
example, the source operand for a LES instruction is not a memory location.
• Attempted to execute an MMX or SSE/SSE2/SSE3 instruction on an Intel 64 or IA-32 processor that does not
support the MMX technology or SSE/SSE2/SSE3/SSSE3 extensions, respectively. CPUID feature flags MMX (bit
23), SSE (bit 25), SSE2 (bit 26), SSE3 (ECX, bit 0), SSSE3 (ECX, bit 9) indicate support for these extensions.
• Attempted to execute an MMX instruction or SSE/SSE2/SSE3/SSSE3 SIMD instruction (with the exception of
the MOVNTI, PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE, CLFLUSH, MONITOR, and MWAIT instructions)
when the EM flag in control register CR0 is set (1).
• Attempted to execute an SSE/SE2/SSE3/SSSE3 instruction when the OSFXSR bit in control register CR4 is
clear (0). Note this does not include the following SSE/SSE2/SSE3 instructions: MASKMOVQ, MOVNTQ,
MOVNTI, PREFETCHh, SFENCE, LFENCE, MFENCE, and CLFLUSH; or the 64-bit versions of the PAVGB, PAVGW,
PEXTRW, PINSRW, PMAXSW, PMAXUB, PMINSW, PMINUB, PMOVMSKB, PMULHUW, PSADBW, PSHUFW, PADDQ,
PSUBQ, PALIGNR, PABSB, PABSD, PABSW, PHADDD, PHADDSW, PHADDW, PHSUBD, PHSUBSW, PHSUBW,
PMADDUBSM, PMULHRSW, PSHUFB, PSIGNB, PSIGND, and PSIGNW.
• Attempted to execute an SSE/SSE2/SSE3/SSSE3 instruction on an Intel 64 or IA-32 processor that caused a
SIMD floating-point exception when the OSXMMEXCPT bit in control register CR4 is clear (0).
• Executed a UD0, UD1 or UD2 instruction. Note that even though it is the execution of the UD0, UD1 or UD2
instruction that causes the invalid opcode exception, the saved instruction pointer will still points at the UD0,
UD1 or UD2 instruction.
• Detected a LOCK prefix that precedes an instruction that may not be locked or one that may be locked but the
destination operand is not a memory location.
• Attempted to execute an LLDT, SLDT, LTR, STR, LSL, LAR, VERR, VERW, or ARPL instruction while in real-
address or virtual-8086 mode.
• Attempted to execute the RSM instruction when not in SMM mode.
In Intel 64 and IA-32 processors that implement out-of-order execution microarchitectures, this exception is not
generated until an attempt is made to retire the result of executing an invalid instruction; that is, decoding and
speculatively attempting to execute an invalid opcode does not generate this exception. Likewise, in the Pentium
processor and earlier IA-32 processors, this exception is not generated as the result of prefetching and preliminary
decoding of an invalid instruction. (See Section 6.5, “Exception Classifications,” for general rules for taking of inter-
rupts and exceptions.)
The opcodes D6 and F1 are undefined opcodes reserved by the Intel 64 and IA-32 architectures. These opcodes,
even though undefined, do not generate an invalid opcode exception.
Vol. 3A 6-31
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates one of the following things:
The device-not-available exception is generated by either of three conditions:
• The processor executed an x87 FPU floating-point instruction while the EM flag in control register CR0 was set
(1). See the paragraph below for the special case of the WAIT/FWAIT instruction.
• The processor executed a WAIT/FWAIT instruction while the MP and TS flags of register CR0 were set,
regardless of the setting of the EM flag.
• The processor executed an x87 FPU, MMX, or SSE/SSE2/SSE3 instruction (with the exception of MOVNTI,
PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE, and CLFLUSH) while the TS flag in control register CR0 was set
and the EM flag is clear.
The EM flag is set when the processor does not have an internal x87 FPU floating-point unit. A device-not-available
exception is then generated each time an x87 FPU floating-point instruction is encountered, allowing an exception
handler to call floating-point instruction emulation routines.
The TS flag indicates that a context switch (task switch) has occurred since the last time an x87 floating-point,
MMX, or SSE/SSE2/SSE3 instruction was executed; but that the context of the x87 FPU, XMM, and MXCSR registers
were not saved. When the TS flag is set and the EM flag is clear, the processor generates a device-not-available
exception each time an x87 floating-point, MMX, or SSE/SSE2/SSE3 instruction is encountered (with the exception
of the instructions listed above). The exception handler can then save the context of the x87 FPU, XMM, and MXCSR
registers before it executes the instruction. See Section 2.5, “Control Registers,” for more information about the TS
flag.
The MP flag in control register CR0 is used along with the TS flag to determine if WAIT or FWAIT instructions should
generate a device-not-available exception. It extends the function of the TS flag to the WAIT and FWAIT instruc-
tions, giving the exception handler an opportunity to save the context of the x87 FPU before the WAIT or FWAIT
instruction is executed. The MP flag is provided primarily for use with the Intel 286 and Intel386 DX processors. For
programs running on the Pentium 4, Intel Xeon, P6 family, Pentium, or Intel486 DX processors, or the Intel 487 SX
coprocessors, the MP flag should always be set; for programs running on the Intel486 SX processor, the MP flag
should be clear.
6-32 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected a second exception while calling an exception handler for a prior exception.
Normally, when the processor detects another exception while trying to call an exception handler, the two excep-
tions can be handled serially. If, however, the processor cannot handle them serially, it signals the double-fault
exception. To determine when two faults need to be signalled as a double fault, the processor divides the excep-
tions into three classes: benign exceptions, contributory exceptions, and page faults (see Table 6-4).
Table 6-5 shows the various combinations of exception classes that cause a double fault to be generated. A double-
fault exception falls in the abort class of exceptions. The program or task cannot be restarted or resumed. The
double-fault handler can be used to collect diagnostic information about the state of the machine and/or, when
possible, to shut the application and/or system down gracefully or restart the system.
Vol. 3A 6-33
INTERRUPT AND EXCEPTION HANDLING
A segment or page fault may be encountered while prefetching instructions; however, this behavior is outside the
domain of Table 6-5. Any further faults generated while the processor is attempting to transfer control to the appro-
priate fault handler could still lead to a double-fault sequence.
Contributory Handle Exceptions Serially Generate a Double Fault Handle Exceptions Serially
Page Fault Handle Exceptions Serially Generate a Double Fault Generate a Double Fault
Double Fault Handle Exceptions Serially Enter Shutdown Mode Enter Shutdown Mode
If another contributory or page fault exception occurs while attempting to call the double-fault handler, the
processor enters shutdown mode. This mode is similar to the state following execution of an HLT instruction. In this
mode, the processor stops executing instructions until an NMI interrupt, SMI interrupt, hardware reset, or INIT# is
received. The processor generates a special bus cycle to indicate that it has entered shutdown mode. Software
designers may need to be aware of the response of hardware when it goes into shutdown mode. For example, hard-
ware may turn on an indicator light on the front panel, generate an NMI interrupt to record diagnostic information,
invoke reset initialization, generate an INIT initialization, or generate an SMI. If any events are pending during
shutdown, they will be handled after an wake event from shutdown is processed (for example, A20M# interrupts).
If a shutdown occurs while the processor is executing an NMI interrupt handler, then only a hardware reset can
restart the processor. Likewise, if the shutdown occurs while executing in SMM, a hardware reset must be used to
restart the processor.
6-34 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Exception Class Abort. (Intel reserved; do not use. Recent IA-32 processors do not generate this
exception.)
Description
Indicates that an Intel386 CPU-based systems with an Intel 387 math coprocessor detected a page or segment
violation while transferring the middle portion of an Intel 387 math coprocessor operand. The P6 family, Pentium,
and Intel486 processors do not generate this exception; instead, this condition is detected with a general protec-
tion exception (#GP), interrupt 13.
Vol. 3A 6-35
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that there was an error related to a TSS. Such an error might be detected during a task switch or during
the execution of instructions that use information from a TSS. Table 6-6 shows the conditions that cause an invalid
TSS exception to be generated.
6-36 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
This exception can generated either in the context of the original task or in the context of the new task (see Section
7.3, “Task Switching”). Until the processor has completely verified the presence of the new TSS, the exception is
generated in the context of the original task. Once the existence of the new TSS is verified, the task switch is
considered complete. Any invalid-TSS conditions detected after this point are handled in the context of the new
task. (A task switch is considered complete when the task register is loaded with the segment selector for the new
TSS and, if the switch is due to a procedure call or interrupt, the previous task link field of the new TSS references
the old TSS.)
The invalid-TSS handler must be a task called using a task gate. Handling this exception inside the faulting TSS
context is not recommended because the processor state may not be consistent.
Vol. 3A 6-37
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the present flag of a segment or gate descriptor is clear. The processor can generate this exception
during any of the following operations:
• While attempting to load CS, DS, ES, FS, or GS registers. [Detection of a not-present segment while loading the
SS register causes a stack fault exception (#SS) to be generated.] This situation can occur while performing a
task switch.
• While attempting to load the LDTR using an LLDT instruction. Detection of a not-present LDT while loading the
LDTR during a task switch operation causes an invalid-TSS exception (#TS) to be generated.
• When executing the LTR instruction and the TSS is marked not present.
• While attempting to use a gate descriptor or TSS that is marked segment-not-present, but is otherwise valid.
An operating system typically uses the segment-not-present exception to implement virtual memory at the
segment level. If the exception handler loads the segment and returns, the interrupted program or task resumes
execution.
A not-present indication in a gate descriptor, however, does not indicate that a segment is not present (because
gates do not correspond to segments). The operating system may use the present flag for gate descriptors to
trigger exceptions of special significance to the operating system.
A contributory exception or page fault that subsequently referenced a not-present segment would cause a double
fault (#DF) to be generated instead of #NP.
6-38 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
occurs. If it occurs after the commit point, the processor will load all the state information from the new TSS
(without performing any additional limit, present, or type checks) before it generates the exception. The segment-
not-present exception handler should not rely on being able to use the segment selectors found in the CS, SS, DS,
ES, FS, and GS registers without causing another exception. (See the Program State Change description for “Inter-
rupt 10—Invalid TSS Exception (#TS)” in this chapter for additional information on how to handle this situation.)
Vol. 3A 6-39
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that one of the following stack related conditions was detected:
• A limit violation is detected during an operation that refers to the SS register. Operations that can cause a limit
violation include stack-oriented instructions such as POP, PUSH, CALL, RET, IRET, ENTER, and LEAVE, as well as
other memory references which implicitly or explicitly use the SS register (for example, MOV AX, [BP+6] or
MOV AX, SS:[EAX+6]). The ENTER instruction generates this exception when there is not enough stack space
for allocating local variables.
• A not-present stack segment is detected when attempting to load the SS register. This violation can occur
during the execution of a task switch, a CALL instruction to a different privilege level, a return to a different
privilege level, an LSS instruction, or a MOV or POP instruction to the SS register.
• A canonical violation is detected in 64-bit mode during an operation that reference memory using the stack
pointer register containing a non-canonical memory address.
Recovery from this fault is possible by either extending the limit of the stack segment (in the case of a limit viola-
tion) or loading the missing stack segment into memory (in the case of a not-present violation.
In the case of a canonical violation that was caused intentionally by software, recovery is possible by loading the
correct canonical value into RSP. Otherwise, a canonical violation of the address in RSP likely reflects some register
corruption in the software.
6-40 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected one of a class of protection violations called “general-protection violations.”
The conditions that cause this exception to be generated comprise all the protection violations that do not cause
other exceptions to be generated (such as, invalid-TSS, segment-not-present, stack-fault, or page-fault excep-
tions). The following conditions cause general-protection exceptions to be generated:
• Exceeding the segment limit when accessing the CS, DS, ES, FS, or GS segments.
• Exceeding the segment limit when referencing a descriptor table (except during a task switch or a stack
switch).
• Transferring execution to a segment that is not executable.
• Writing to a code segment or a read-only data segment.
• Reading from an execute-only code segment.
• Loading the SS register with a segment selector for a read-only segment (unless the selector comes from a TSS
during a task switch, in which case an invalid-TSS exception occurs).
• Loading the SS, DS, ES, FS, or GS register with a segment selector for a system segment.
• Loading the DS, ES, FS, or GS register with a segment selector for an execute-only code segment.
• Loading the SS register with the segment selector of an executable segment or a null segment selector.
• Loading the CS register with a segment selector for a data segment or a null segment selector.
• Accessing memory using the DS, ES, FS, or GS register when it contains a null segment selector.
• Switching to a busy task during a call or jump to a TSS.
• Using a segment selector on a non-IRET task switch that points to a TSS descriptor in the current LDT. TSS
descriptors can only reside in the GDT. This condition causes a #TS exception during an IRET task switch.
• Violating any of the privilege rules described in Chapter 5, “Protection.”
• Exceeding the instruction length limit of 15 bytes (this only can occur when redundant prefixes are placed
before an instruction).
• Loading the CR0 register with a set PG flag (paging enabled) and a clear PE flag (protection disabled).
• Loading the CR0 register with a set NW flag and a clear CD flag.
• Referencing an entry in the IDT (following an interrupt or exception) that is not an interrupt, trap, or task gate.
• Attempting to access an interrupt or exception handler through an interrupt or trap gate from virtual-8086
mode when the handler’s code segment DPL is greater than 0.
• Attempting to write a 1 into a reserved bit of CR4.
• Attempting to execute a privileged instruction when the CPL is not equal to 0 (see Section 5.9, “Privileged
Instructions,” for a list of privileged instructions).
• Attempting to execute SGDT, SIDT, SLDT, SMSW, or STR when CR4.UMIP = 1 and the CPL is not equal to 0.
• Writing to a reserved bit in an MSR.
• Accessing a gate that contains a null segment selector.
• Executing the INT n instruction when the CPL is greater than the DPL of the referenced interrupt, trap, or task
gate.
• The segment selector in a call, interrupt, or trap gate does not point to a code segment.
• The segment selector operand in the LLDT instruction is a local type (TI flag is set) or does not point to a
segment descriptor of the LDT type.
• The segment selector operand in the LTR instruction is local or points to a TSS that is not available.
• The target code-segment selector for a call, jump, or return is null.
Vol. 3A 6-41
INTERRUPT AND EXCEPTION HANDLING
• If the PAE and/or PSE flag in control register CR4 is set and the processor detects any reserved bits in a page-
directory-pointer-table entry set to 1. These bits are checked during a write to control registers CR0, CR3, or
CR4 that causes a reloading of the page-directory-pointer-table entry.
• Attempting to write a non-zero value into the reserved bits of the MXCSR register.
• Executing an SSE/SSE2/SSE3 instruction that attempts to access a 128-bit memory location that is not aligned
on a 16-byte boundary when the instruction requires 16-byte alignment. This condition also applies to the stack
segment.
A program or task can be restarted following any general-protection exception. If the exception occurs while
attempting to call an interrupt handler, the interrupted program can be restartable, but the interrupt may be lost.
6-42 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
• If the segment descriptor pointed to by the segment selector in the destination operand is a code segment and
it has both the D-bit and the L-bit set.
• If the segment descriptor from a 64-bit call gate is in non-canonical space.
• If the DPL from a 64-bit call-gate is less than the CPL or than the RPL of the 64-bit call-gate.
• If the type field of the upper 64 bits of a 64-bit call gate is not 0.
• If an attempt is made to load a null selector in the SS register in compatibility mode.
• If an attempt is made to load null selector in the SS register in CPL3 and 64-bit mode.
• If an attempt is made to load a null selector in the SS register in non-CPL3 and 64-bit mode where RPL is not
equal to CPL.
• If an attempt is made to clear CR0.PG while IA-32e mode is enabled.
• If an attempt is made to set a reserved bit in CR3, CR4 or CR8.
Vol. 3A 6-43
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that, with paging enabled (the PG flag in the CR0 register is set), the processor detected one of the
following conditions while using the page-translation mechanism to translate a linear address to a physical
address:
• The P (present) flag in a page-directory or page-table entry needed for the address translation is clear,
indicating that a page table or the page containing the operand is not present in physical memory.
• The procedure does not have sufficient privilege to access the indicated page (that is, a procedure running in
user mode attempts to access a supervisor-mode page). If the SMAP flag is set in CR4, a page fault may also
be triggered by code running in supervisor mode that tries to access data at a user-mode address. If the PKE
flag is set in CR4, the PKRU register may cause page faults on data accesses to user-mode addresses with
certain protection keys.
• Code running in user mode attempts to write to a read-only page. If the WP flag is set in CR0, the page fault
will also be triggered by code running in supervisor mode that tries to write to a read-only page.
• An instruction fetch to a linear address that translates to a physical address in a memory page with the
execute-disable bit set (for information about the execute-disable bit, see Chapter 4, “Paging”). If the SMEP
flag is set in CR4, a page fault will also be triggered by code running in supervisor mode that tries to fetch an
instruction from a user-mode address.
• One or more reserved bits in paging-structure entry are set to 1. See description below of RSVD error code flag.
• A shadow-stack access is made to a page that is not a shadow-stack page. See Section 18.2, “Shadow Stacks”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 and Chapter 4.6, “Access
Rights.”
• An enclave access violates one of the specified access-control requirements. See Section 37.3, “Access-control
Requirements” and Section 37.20, “Enclave Page Cache Map (EPCM)” in Chapter 37, “Enclave Access Control
and Data Structures.” In this case, the exception is called an SGX-induced page fault. The processor uses the
error code (below) to distinguish SGX-induced page faults from ordinary page faults.
The exception handler can recover from page-not-present conditions and restart the program or task without any
loss of program continuity. It can also restart the program or task after a privilege violation, but the problem that
caused the privilege violation may be uncorrectable.
See also: Section 4.7, “Page-Fault Exceptions.”
6-44 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
31 15 6 5 4 3 2 1 0
SGX
SS
PK
I/D
RSVD
U/S
W/R
P
Reserved Reserved
• The contents of the CR2 register. The processor loads the CR2 register with the 32-bit linear address that
generated the exception. The page-fault handler can use this address to locate the corresponding page
directory and page-table entries. Another page fault can potentially occur during execution of the page-fault
handler; the handler should save the contents of the CR2 register before a second page fault can occur.6 If a
page fault is caused by a page-level protection violation, the access flag in the page-directory entry is set when
Vol. 3A 6-45
INTERRUPT AND EXCEPTION HANDLING
the fault occurs. The behavior of IA-32 processors regarding the access flag in the corresponding page-table
entry is model specific and not architecturally defined.
6. Processors update CR2 whenever a page fault is detected. If a second page fault occurs while an earlier page fault is being deliv-
ered, the faulting linear address of the second fault will overwrite the contents of CR2 (replacing the previous address). These
updates to CR2 occur even if the page fault results in a double fault or occurs during the delivery of a double fault.
6-46 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the x87 FPU has detected a floating-point error. The NE flag in the register CR0 must be set for an
interrupt 16 (floating-point error exception) to be generated. (See Section 2.5, “Control Registers,” for a detailed
description of the NE flag.)
NOTE
SIMD floating-point exceptions (#XM) are signaled through interrupt 19.
While executing x87 FPU instructions, the x87 FPU detects and reports six types of floating-point error conditions:
• Invalid operation (#I)
— Stack overflow or underflow (#IS)
— Invalid arithmetic operation (#IA)
• Divide-by-zero (#Z)
• Denormalized operand (#D)
• Numeric overflow (#O)
• Numeric underflow (#U)
• Inexact result (precision) (#P)
Each of these error conditions represents an x87 FPU exception type, and for each of exception type, the x87 FPU
provides a flag in the x87 FPU status register and a mask bit in the x87 FPU control register. If the x87 FPU detects
a floating-point error and the mask bit for the exception type is set, the x87 FPU handles the exception automati-
cally by generating a predefined (default) response and continuing program execution. The default responses have
been designed to provide a reasonable result for most floating-point applications.
If the mask for the exception is clear and the NE flag in register CR0 is set, the x87 FPU does the following:
1. Sets the necessary flag in the FPU status register.
2. Waits until the next “waiting” x87 FPU instruction or WAIT/FWAIT instruction is encountered in the program’s
instruction stream.
3. Generates an internal error signal that cause the processor to generate a floating-point exception (#MF).
Prior to executing a waiting x87 FPU instruction or the WAIT/FWAIT instruction, the x87 FPU checks for pending
x87 FPU floating-point exceptions (as described in step 2 above). Pending x87 FPU floating-point exceptions are
ignored for “non-waiting” x87 FPU instructions, which include the FNINIT, FNCLEX, FNSTSW, FNSTSW AX, FNSTCW,
FNSTENV, and FNSAVE instructions. Pending x87 FPU exceptions are also ignored when executing the state
management instructions FXSAVE and FXRSTOR.
All of the x87 FPU floating-point error conditions can be recovered from. The x87 FPU floating-point-error exception
handler can determine the error condition that caused the exception from the settings of the flags in the x87 FPU
status word. See “Software Exception Handling” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1, for more information on handling x87 FPU floating-point exceptions.
Vol. 3A 6-47
INTERRUPT AND EXCEPTION HANDLING
register. See Section 8.1.8, “x87 FPU Instruction and Data (Operand) Pointers” in Chapter 8 of the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 1, for more information about information the FPU saves
for use in handling floating-point-error exceptions.
6-48 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected an unaligned memory operand when alignment checking was enabled. Align-
ment checks are only carried out in data (or stack) accesses (not in code fetches or system segment accesses). An
example of an alignment-check violation is a word stored at an odd byte address, or a doubleword stored at an
address that is not an integer multiple of 4. Table 6-7 lists the alignment requirements various data types recog-
nized by the processor.
Note that the alignment check exception (#AC) is generated only for data types that must be aligned on word,
doubleword, and quadword boundaries. A general-protection exception (#GP) is generated 128-bit data types that
are not aligned on a 16-byte boundary.
To enable alignment checking, the following conditions must be true:
• AM flag in CR0 register is set.
• AC flag in the EFLAGS register is set.
• The CPL is 3 (including virtual-8086 mode).
Alignment-check exceptions (#AC) are generated only when operating at privilege level 3 (user mode). Memory
references that default to privilege level 0, such as segment descriptor loads, do not generate alignment-check
exceptions, even when caused by a memory reference made from privilege level 3.
Storing the contents of the GDTR, IDTR, LDTR, or task register in memory while at privilege level 3 can generate
an alignment-check exception. Although application programs do not normally store these registers, the fault can
be avoided by aligning the information stored on an even word-address.
The FXSAVE/XSAVE and FXRSTOR/XRSTOR instructions save and restore a 512-byte data structure, the first byte
of which must be aligned on a 16-byte boundary. If the alignment-check exception (#AC) is enabled when
executing these instructions (and CPL is 3), a misaligned memory operand can cause either an alignment-check
exception or a general-protection exception (#GP) depending on the processor implementation (see “FXSAVE-
Save x87 FPU, MMX, SSE, and SSE2 State” and “FXRSTOR-Restore x87 FPU, MMX, SSE, and SSE2 State” in
Vol. 3A 6-49
INTERRUPT AND EXCEPTION HANDLING
Chapter 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A; see “XSAVE—Save
Processor Extended States” and “XRSTOR—Restore Processor Extended States” in Chapter 5 of the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 2C).
The MOVDQU, MOVUPS, and MOVUPD instructions perform 128-bit unaligned loads or stores. The LDDQU instruc-
tions loads 128-bit unaligned data. They do not generate general-protection exceptions (#GP) when operands are
not aligned on a 16-byte boundary. If alignment checking is enabled, alignment-check exceptions (#AC) may or
may not be generated depending on processor implementation when data addresses are not aligned on an 8-byte
boundary.
FSAVE and FRSTOR instructions can generate unaligned references, which can cause alignment-check faults. These
instructions are rarely needed by application programs.
6-50 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected an internal machine error or a bus error, or that an external agent detected
a bus error. The machine-check exception is model-specific, available on the Pentium and later generations of
processors. The implementation of the machine-check exception is different between different processor families,
and these implementations may not be compatible with future Intel 64 or IA-32 processors. (Use the CPUID
instruction to determine whether this feature is present.)
Bus errors detected by external agents are signaled to the processor on dedicated pins: the BINIT# and MCERR#
pins on the Pentium 4, Intel Xeon, and P6 family processors and the BUSCHK# pin on the Pentium processor. When
one of these pins is enabled, asserting the pin causes error information to be loaded into machine-check registers
and a machine-check exception is generated.
The machine-check exception and machine-check architecture are discussed in detail in Chapter 15, “Machine-
Check Architecture.” Also, see the data books for the individual processors for processor-specific hardware infor-
mation.
Vol. 3A 6-51
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates the processor has detected an SSE/SSE2/SSE3 SIMD floating-point exception. The appropriate status
flag in the MXCSR register must be set and the particular exception unmasked for this interrupt to be generated.
There are six classes of numeric exception conditions that can occur while executing an SSE/ SSE2/SSE3 SIMD
floating-point instruction:
• Invalid operation (#I)
• Divide-by-zero (#Z)
• Denormal operand (#D)
• Numeric overflow (#O)
• Numeric underflow (#U)
• Inexact result (Precision) (#P)
The invalid operation, divide-by-zero, and denormal-operand exceptions are pre-computation exceptions; that is,
they are detected before any arithmetic operation occurs. The numeric underflow, numeric overflow, and inexact
result exceptions are post-computational exceptions.
See “SIMD Floating-Point Exceptions” in Chapter 11 of the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 1, for additional information about the SIMD floating-point exception classes.
When a SIMD floating-point exception occurs, the processor does either of the following things:
• It handles the exception automatically by producing the most reasonable result and allowing program
execution to continue undisturbed. This is the response to masked exceptions.
• It generates a SIMD floating-point exception, which in turn invokes a software exception handler. This is the
response to unmasked exceptions.
Each of the six SIMD floating-point exception conditions has a corresponding flag bit and mask bit in the MXCSR
register. If an exception is masked (the corresponding mask bit in the MXCSR register is set), the processor takes
an appropriate automatic default action and continues with the computation. If the exception is unmasked (the
corresponding mask bit is clear) and the operating system supports SIMD floating-point exceptions (the OSXM-
MEXCPT flag in control register CR4 is set), a software exception handler is invoked through a SIMD floating-point
exception. If the exception is unmasked and the OSXMMEXCPT bit is clear (indicating that the operating system
does not support unmasked SIMD floating-point exceptions), an invalid opcode exception (#UD) is signaled instead
of a SIMD floating-point exception.
Note that because SIMD floating-point exceptions are precise and occur immediately, the situation does not arise
where an x87 FPU instruction, a WAIT/FWAIT instruction, or another SSE/SSE2/SSE3 instruction will catch a
pending unmasked SIMD floating-point exception.
In situations where a SIMD floating-point exception occurred while the SIMD floating-point exceptions were
masked (causing the corresponding exception flag to be set) and the SIMD floating-point exception was subse-
quently unmasked, then no exception is generated when the exception is unmasked.
When SSE/SSE2/SSE3 SIMD floating-point instructions operate on packed operands (made up of two or four sub-
operands), multiple SIMD floating-point exception conditions may be detected. If no more than one exception
condition is detected for one or more sets of sub-operands, the exception flags are set for each exception condition
detected. For example, an invalid exception detected for one sub-operand will not prevent the reporting of a divide-
by-zero exception for another sub-operand. However, when two or more exceptions conditions are generated for
one sub-operand, only one exception condition is reported, according to the precedences shown in Table 6-8. This
exception precedence sometimes results in the higher priority exception condition being reported and the lower
priority exception conditions being ignored.
6-52 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Vol. 3A 6-53
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected an EPT violation in VMX non-root operation. Not all EPT violations cause virtu-
alization exceptions. See Section 25.5.6.2 for details.
The exception handler can recover from EPT violations and restart the program or task without any loss of program
continuity. In some cases, however, the problem that caused the EPT violation may be uncorrectable.
6-54 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates a control flow transfer attempt violated the control flow enforcement technology constraints.
31 15 14 0
ENCL
CPEC
Reserved
Vol. 3A 6-55
INTERRUPT AND EXCEPTION HANDLING
In these cases the exception occurs in the context of the new task. The instruction pointer refers to the first instruc-
tion of the new task, not to the instruction which caused the task switch (or the last instruction to be executed, in
the case of an interrupt). If the design of the operating system permits control protection faults to occur during
task-switches, the control protection fault handler should be called through a task gate.
6-56 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor did one of the following things:
• Executed an INT n instruction where the instruction operand is one of the vector numbers from 32 through 255.
• Responded to an interrupt request at the INTR pin or from the local APIC when the interrupt vector number
associated with the request is from 32 through 255.
Vol. 3A 6-57
INTERRUPT AND EXCEPTION HANDLING
6-58 Vol. 3A
13.Updates to Chapter 7, Volume 3A
Change bars and green text show changes to Chapter 7 of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter: Added Control-flow Enforcement Technology task management details.
This chapter describes the IA-32 architecture’s task management facilities. These facilities are only available when
the processor is running in protected mode.
This chapter focuses on 32-bit tasks and the 32-bit TSS structure. For information on 16-bit tasks and the 16-bit
TSS structure, see Section 7.6, “16-Bit Task-State Segment (TSS).” For information specific to task management in
64-bit mode, see Section 7.7, “Task Management in 64-bit Mode.”
Vol. 3A 7-1
TASK MANAGEMENT
Code
Segment
Task-State Data
Segment Segment
(TSS)
Stack
Segment
(Current Priv.
Level)
Stack Seg.
Priv. Level 0
Stack Seg.
Priv. Level 1
Task Register Stack
Segment
CR3 (Priv. Level 2)
7-2 Vol. 3A
TASK MANAGEMENT
to handle an interrupt or exception, the IDT entry for the interrupt or exception must contain a task gate that holds
the selector for the interrupt- or exception-handler TSS.
When a task is dispatched for execution, a task switch occurs between the currently running task and the
dispatched task. During a task switch, the execution environment of the currently executing task (called the task’s
state or context) is saved in its TSS and execution of the task is suspended. The context for the dispatched task is
then loaded into the processor and execution of that task begins with the instruction pointed to by the newly loaded
EIP register. If the task has not been run since the system was last initialized, the EIP will point to the first instruc-
tion of the task’s code; otherwise, it will point to the next instruction after the last instruction that the task
executed when it was last active.
If the currently executing task (the calling task) called the task being dispatched (the called task), the TSS
segment selector for the calling task is stored in the TSS of the called task to provide a link back to the calling task.
For all IA-32 processors, tasks are not recursive. A task cannot call or jump to itself.
Interrupts and exceptions can be handled with a task switch to a handler task. Here, the processor performs a task
switch to handle the interrupt or exception and automatically switches back to the interrupted task upon returning
from the interrupt-handler task or exception-handler task. This mechanism can also handle interrupts that occur
during interrupt tasks.
As part of a task switch, the processor can also switch to another LDT, allowing each task to have a different logical-
to-physical address mapping for LDT-based segments. The page-directory base register (CR3) also is reloaded on a
task switch, allowing each task to have its own set of page tables. These protection facilities help isolate tasks and
prevent them from interfering with one another.
If protection mechanisms are not used, the processor provides no protection between tasks. This is true even with
operating systems that use multiple privilege levels for protection. A task running at privilege level 3 that uses the
same LDT and page tables as other privilege-level-3 tasks can access code and corrupt data and the stack of other
tasks.
Use of task management facilities for handling multitasking applications is optional. Multitasking can be handled in
software, with each software defined task executed in the context of a single IA-32 architecture task.
If shadow stack is enabled, then the SSP of the task is located at the 4 bytes at offset 104 in the 32-bit TSS and is
used by the processor to establish the SSP when a task switch occurs from a task associated with this TSS. Note
that the processor does not write the SSP of the task initiating the task switch to the TSS of that task, and instead
the SSP of the previous task is pushed onto the shadow stack of the new task.
Vol. 3A 7-3
TASK MANAGEMENT
31 15 0
SSP 104
I/O Map Base Address Reserved T 100
Reserved LDT Segment Selector 96
Reserved GS 98
Reserved FS 88
Reserved DS 84
Reserved SS 80
Reserved CS 76
Reserved ES 72
EDI 68
ESI 64
EBP 60
ESP 56
EBX 52
EDX 48
ECX 44
EAX 40
EFLAGS 36
EIP 32
CR3 (PDBR) 28
Reserved SS2 24
ESP2 20
Reserved SS1 16
ESP1 12
Reserved SS0 8
ESP0 4
Reserved Previous Task Link 0
The processor updates dynamic fields when a task is suspended during a task switch. The following are dynamic
fields:
• General-purpose register fields — State of the EAX, ECX, EDX, EBX, ESP, EBP, ESI, and EDI registers prior
to the task switch.
• Segment selector fields — Segment selectors stored in the ES, CS, SS, DS, FS, and GS registers prior to the
task switch.
• EFLAGS register field — State of the EFLAGS register prior to the task switch.
7-4 Vol. 3A
TASK MANAGEMENT
• EIP (instruction pointer) field — State of the EIP register prior to the task switch.
• Previous task link field — Contains the segment selector for the TSS of the previous task (updated on a task
switch that was initiated by a call, interrupt, or exception). This field (which is sometimes called the back link
field) permits a task switch back to the previous task by using the IRET instruction.
The processor reads the static fields, but does not normally change them. These fields are set up when a task is
created. The following are static fields:
• LDT segment selector field — Contains the segment selector for the task's LDT.
• CR3 control register field — Contains the base physical address of the page directory to be used by the task.
Control register CR3 is also known as the page-directory base register (PDBR).
• Privilege level-0, -1, and -2 stack pointer fields — These stack pointers consist of a logical address made
up of the segment selector for the stack segment (SS0, SS1, and SS2) and an offset into the stack (ESP0,
ESP1, and ESP2). Note that the values in these fields are static for a particular task; whereas, the SS and ESP
values will change if stack switching occurs within the task.
• T (debug trap) flag (byte 100, bit 0) — When set, the T flag causes the processor to raise a debug exception
when a task switch to this task occurs (see Section 17.3.1.5, “Task-Switch Exception Condition”).
• I/O map base address field — Contains a 16-bit offset from the base of the TSS to the I/O permission bit
map and interrupt redirection bitmap. When present, these maps are stored in the TSS at higher addresses.
The I/O map base address points to the beginning of the I/O permission bit map and the end of the interrupt
redirection bit map. See Chapter 18, “Input/Output,” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1, for more information about the I/O permission bit map. See Section 20.3,
“Interrupt and Exception Handling in Virtual-8086 Mode,” for a detailed description of the interrupt redirection
bit map.
• Shadow Stack Pointer (SSP) — Contains task's shadow stack pointer. The shadow stack of the task should
have a supervisor shadow stack token at the address pointed to by the task SSP (offset 104). This token will be
verified and made busy when switching to that shadow stack using a CALL/JMP instruction, and made free
when switching out of that task using an IRET instruction.
If paging is used:
• Avoid placing a page boundary in the part of the TSS that the processor reads during a task switch (the first 104
bytes). The processor may not correctly perform address translations if a boundary occurs in this area. During
a task switch, the processor reads and writes into the first 104 bytes of each TSS (using contiguous physical
addresses beginning with the physical address of the first byte of the TSS). So, after TSS access begins, if part
of the 104 bytes is not physically contiguous, the processor will access incorrect information without generating
a page-fault exception.
• Pages corresponding to the previous task’s TSS, the current task’s TSS, and the descriptor table entries for
each all should be marked as read/write.
• Task switches are carried out faster if the pages containing these structures are present in memory before the
task switch is initiated.
Vol. 3A 7-5
TASK MANAGEMENT
TSS Descriptor
31 24 23 22 21 20 19 16 15 14 13 12 11 8 7 0
A Limit D Type
Base 31:24 G 0 0 V P P Base 23:16 4
19:16
L L 0 1 0 B 1
31 16 15 0
The base, limit, and DPL fields and the granularity and present flags have functions similar to their use in data-
segment descriptors (see Section 3.4.5, “Segment Descriptors”). When the G flag is 0 in a TSS descriptor for a 32-
bit TSS, the limit field must have a value equal to or greater than 67H, one byte less than the minimum size of a
TSS. Attempting to switch to a task whose TSS descriptor has a limit less than 67H generates an invalid-TSS excep-
tion (#TS). A larger limit is required if an I/O permission bit map is included or if the operating system stores addi-
tional data. The processor does not check for a limit greater than 67H on a task switch; however, it does check
when accessing the I/O permission bit map or interrupt redirection bit map.
Any program or procedure with access to a TSS descriptor (that is, whose CPL is numerically equal to or less than
the DPL of the TSS descriptor) can dispatch the task with a call or a jump.
In most systems, the DPLs of TSS descriptors are set to values less than 3, so that only privileged software can
perform task switching. However, in multitasking applications, DPLs for some TSS descriptors may be set to 3 to
allow task switching at the application (or user) privilege level.
7-6 Vol. 3A
TASK MANAGEMENT
Reserved 0 Reserved 12
31 0
31 24 23 22 21 20 19 16 15 14 13 12 11 8 7 0
A Limit D Type
Base 31:24 G 0 0 V P P Base 23:16 4
19:16
L L 0
31 16 15 0
Vol. 3A 7-7
TASK MANAGEMENT
TSS
+
GDT
TSS Descriptor
31 16 15 14 13 12 11 8 7 0
D Type
Reserved P P Reserved 4
L 0 0 1 0 1
31 16 15 0
7-8 Vol. 3A
TASK MANAGEMENT
A task can be accessed either through a task-gate descriptor or a TSS descriptor. Both of these structures satisfy
the following needs:
• Need for a task to have only one busy flag — Because the busy flag for a task is stored in the TSS
descriptor, each task should have only one TSS descriptor. There may, however, be several task gates that
reference the same TSS descriptor.
• Need to provide selective access to tasks — Task gates fill this need, because they can reside in an LDT and
can have a DPL that is different from the TSS descriptor's DPL. A program or procedure that does not have
sufficient privilege to access the TSS descriptor for a task in the GDT (which usually has a DPL of 0) may be
allowed access to the task through a task gate with a higher DPL. Task gates give the operating system greater
latitude for limiting access to specific tasks.
• Need for an interrupt or exception to be handled by an independent task — Task gates may also reside
in the IDT, which allows interrupts and exceptions to be handled by handler tasks. When an interrupt or
exception vector points to a task gate, the processor switches to the specified task.
Figure 7-7 illustrates how a task gate in an LDT, a task gate in the GDT, and a task gate in the IDT can all point to
the same task.
Task Gate
IDT
Task Gate
Vol. 3A 7-9
TASK MANAGEMENT
1. The INT1 has opcode F1; the INT n instruction with n=1 has opcode CD 01.
7-10 Vol. 3A
TASK MANAGEMENT
Vol. 3A 7-11
TASK MANAGEMENT
15. If CET is enabled, the processor performs following shadow stack actions:
IF shadow stack enabled at current CPL OR indirect branch tracking at current CPL
THEN
IF EFLAGS.VM = 1
THEN #TSS(new-Task-TSS);FI;
FI;
IF shadow stack enabled at current CPL
IF task switch initiated by CALL instruction, JMP instruction, interrupt or exception (* switch stack *)
new_SSP ← Load the 4 byte from offset 104 in the TSS
// Verify new SSP to be legal
IF new_SSP & 0x07 != 0
THEN #TSS(New-Task-TSS); FI;
expected_token_value = SSP; (* busy - bit 0 - must be clear *)
new_token_value = SSP | BUSY_BIT (* set the busy bit - bit 0*)
IF shadow_stack_lock_cmpxchg8b(SSP, new_token_value,
expected_token_value) != expected_token_value
THEN #TSS(New-Task-TSS); FI;
SSP = new_SSP
IF pushCsLipSsp = 1 (* call, int, exception from user → user or supv → supv or supv → user *)
Push tempSsCS, tempSsLip, tempSsSSP on shadow stack using 8B pushes
FI;
FI;
FI;
IF task switch initiated by IRET
IF verifyCsLIP = 1
(* do 64 bit comparisons; CS zero padded to 64 bit; CSBASE+EIP zero padded to 64 bit *)
IF tempSsCS and tempSsLIP do not match CS and CSBASE+EIP
THEN #CP(FAR-RET/IRET); FI;
FI;
IF ShadowStackEnabled(CPL)
THEN
IF (verifyCsLIP == 0) tempSSP = IA32_PL3_SSP;
IF tempSSP & 0x03 != 0 THEN #CP(FAR-RET/IRET) // verify aligned to 4 bytes
IF tempSSP[63:32] != 0 THEN # CP(FAR-RET/IRET)
SSP = tempSSP
FI;
FI;
IF EndbranchEnabled(CPL)
IF task switch initiated by CALL instruction, JMP instruction, interrupt or exception
IF CPL = 3
THEN
IA32_U_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_U_CET.SUPPRESS = 0
7-12 Vol. 3A
TASK MANAGEMENT
ELSE
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_S_CET.SUPPRESS = 0
FI;
FI;
FI;
16. Begins executing the new task. (To an exception handler, the first instruction of the new task appears not to
have been executed.)
NOTES
If all checks and saves have been carried out successfully, the processor commits to the task
switch. If an unrecoverable error occurs in steps 1 through 8, the processor does not complete the
task switch and insures that the processor is returned to its state prior to the execution of the
instruction that initiated the task switch.
If an unrecoverable error occurs in step 9, architectural state may be corrupted, but an attempt will
be made to handle the error in the prior execution environment. If an unrecoverable error occurs
after the commit point (in step 13), the processor completes the task switch (without performing
additional access and segment availability checks) and generates the appropriate exception prior to
beginning execution of the new task.
If exceptions occur after the commit point, the exception handler must finish the task switch itself
before allowing the processor to begin executing the new task. See Chapter 6, “Interrupt
10—Invalid TSS Exception (#TS),” for more information about the affect of exceptions on a task
when they occur after the commit point of a task switch.
The state of the currently executing task is always saved when a successful task switch occurs. If the task is
resumed, execution starts with the instruction pointed to by the saved EIP value, and the registers are restored to
the values they held when the task was suspended.
When switching tasks, the privilege level of the new task does not inherit its privilege level from the suspended
task. The new task begins executing at the privilege level specified in the CPL field of the CS register, which is
loaded from the TSS. Because tasks are isolated by their separate address spaces and TSSs and because privilege
rules control access to a TSS, software does not need to perform explicit privilege checks on a task switch.
Table 7-1 shows the exception conditions that the processor checks for when switching tasks. It also shows the
exception that is generated for each check if an error is detected and the segment that the error code references.
(The order of the checks in the table is the order used in the P6 family processors. The exact order is model specific
and may be different for other IA-32 processors.) Exception handlers designed to handle these exceptions may be
subject to recursive calls if they attempt to reload the segment selector that generated the exception. The cause of
the exception (or the first of multiple causes) should be fixed before reloading the selector.
Vol. 3A 7-13
TASK MANAGEMENT
The TS (task switched) flag in the control register CR0 is set every time a task switch occurs. System software uses
the TS flag to coordinate the actions of floating-point unit when generating floating-point exceptions with the rest
of the processor. The TS flag indicates that the context of the floating-point unit may be different from that of the
current task. See Section 2.5, “Control Registers”, for a detailed description of the function and use of the TS flag.
7-14 Vol. 3A
TASK MANAGEMENT
Table 7-2 shows the busy flag (in the TSS segment descriptor), the NT flag, the previous task link field, and TS flag
(in control register CR0) during a task switch.
The NT flag may be modified by software executing at any privilege level. It is possible for a program to set the NT
flag and execute an IRET instruction. This might randomly invoke the task specified in the previous link field of the
current task's TSS. To keep such spurious task switches from succeeding, the operating system should initialize the
previous task link field in every TSS that it creates to 0.
Table 7-2. Effect of a Task Switch on Busy Flag, NT Flag, Previous Task Link Field, and TS Flag
Flag or Field Effect of JMP instruction Effect of CALL Instruction or Effect of IRET
Interrupt Instruction
Busy (B) flag of new task. Flag is set. Must have been Flag is set. Must have been No change. Must have been set.
clear before. clear before.
Busy flag of old task. Flag is cleared. No change. Flag is currently Flag is cleared.
set.
NT flag of new task. Set to value from TSS of new Flag is set. Set to value from TSS of new
task. task.
NT flag of old task. No change. No change. Flag is cleared.
Previous task link field of new No change. Loaded with selector No change.
task. for old task’s TSS.
Previous task link field of old No change. No change. No change.
task.
TS flag in control register CR0. Flag is set. Flag is set. Flag is set.
Vol. 3A 7-15
TASK MANAGEMENT
7-16 Vol. 3A
TASK MANAGEMENT
Because all tasks have access to the GDT, it also is possible to create shared segments accessed through segment
descriptors in this table.
If paging is enabled, the CR3 register (PDBR) field in the TSS allows each task to have its own set of page tables
for mapping linear addresses to physical addresses. Or, several tasks can share the same set of page tables.
Task A
PTE
PTE
PDBR PDE PTE
Task A
PDE
Shared PT
Shared
PTE
PTE Shared
Task B TSS
Task B
Vol. 3A 7-17
TASK MANAGEMENT
7-18 Vol. 3A
TASK MANAGEMENT
15 0
Task LDT Selector 42
DS Selector 40
SS Selector 38
CS Selector 36
ES Selector 34
DI 32
SI 30
BP 28
SP 26
BX 24
DX 22
CX 20
AX 18
FLAG Word 16
IP (Entry Point) 14
SS2 12
SP2 10
SS1 8
SP1 6
SS0 4
SP0 2
Previous Task Link 0
Vol. 3A 7-19
TASK MANAGEMENT
31 15 0
I/O Map Base Address Reserved 100
Reserved 96
Reserved 92
IST7 (upper 32 bits) 88
IST7 (lower 32 bits) 84
IST6 (upper 32 bits) 80
IST6 (lower 32 bits) 76
IST5 (upper 32 bits) 72
IST5 (lower 32 bits) 68
IST4 (upper 32 bits) 64
IST4 (lower 32 bits) 60
IST3 (upper 32 bits) 56
IST3 (lower 32 bits) 52
IST2 (upper 32 bits) 48
IST2 (lower 32 bits) 44
IST1 (upper 32 bits) 40
IST1 (lower 32 bits) 36
Reserved 32
Reserved 28
RSP2 (upper 32 bits) 24
RSP2 (lower 32 bits) 20
RSP1 (upper 32 bits) 16
RSP1 (lower 32 bits) 12
RSP0 (upper 32 bits) 8
RSP0 (lower 32 bits) 4
Reserved 0
7-20 Vol. 3A
14.Updates to Chapter 11, Volume 3A
Change bars and green text show changes to Chapter 11 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter: Update to Section 11.3 “Methods of Caching Available”, specifically Write Combining
(WC) details.
This chapter describes the memory cache and cache control mechanisms, the TLBs, and the store buffer in Intel 64
and IA-32 processors. It also describes the memory type range registers (MTRRs) introduced in the P6 family
processors and how they are used to control caching of physical memory locations.
Physical
Memory
System Bus
(External)
Data Cache
L2 Cache L3 Cache† Unit (L1)
Instruction
TLBs
Figure 11-1. Cache Structure of the Pentium 4 and Intel Xeon Processors
Vol. 3A 11-1
MEMORY CACHE CONTROL
Out-of-Order Engine
QPI
Data Cache
Unit (L1)
L2 Cache
L3 Cache
11-2 Vol. 3A
MEMORY CACHE CONTROL
Vol. 3A 11-3
MEMORY CACHE CONTROL
Intel 64 and IA-32 processors may implement four types of caches: the trace cache, the level 1 (L1) cache, the
level 2 (L2) cache, and the level 3 (L3) cache. See Figure 11-1. Cache availability is described below:
• Intel Core i7, i5, i3 processor Family and Intel Xeon processor Family based on Intel® microarchi-
tecture code name Nehalem and Intel® microarchitecture code name Westmere — The L1 cache is
divided into two sections: one section is dedicated to caching instructions (pre-decoded instructions) and the
other caches data. The L2 cache is a unified data and instruction cache. Each processor core has its own L1 and
L2. The L3 cache is an inclusive, unified data and instruction cache, shared by all processor cores inside a
physical package. No trace cache is implemented.
• Intel® Core™ 2 processor family and Intel® Xeon® processor family based on Intel® Core™ micro-
architecture — The L1 cache is divided into two sections: one section is dedicated to caching instructions (pre-
decoded instructions) and the other caches data. The L2 cache is a unified data and instruction cache located
on the processor chip; it is shared between two processor cores in a dual-core processor implementation.
Quad-core processors have two L2, each shared by two processor cores. No trace cache is implemented.
• Intel® Atom™ processor — The L1 cache is divided into two sections: one section is dedicated to caching
instructions (pre-decoded instructions) and the other caches data. The L2 cache is a unified data and
instruction cache is located on the processor chip. No trace cache is implemented.
• Intel® Core™ Solo and Intel® Core™ Duo processors — The L1 cache is divided into two sections: one
section is dedicated to caching instructions (pre-decoded instructions) and the other caches data. The L2 cache
is a unified data and instruction cache located on the processor chip. It is shared between two processor cores
in a dual-core processor implementation. No trace cache is implemented.
• Pentium® 4 and Intel® Xeon® processors Based on Intel NetBurst® microarchitecture — The trace
cache caches decoded instructions (μops) from the instruction decoder and the L1 cache contains data. The L2
and L3 caches are unified data and instruction caches located on the processor chip. Dualcore processors have
two L2, one in each processor core. Note that the L3 cache is only implemented on some Intel Xeon processors.
• P6 family processors — The L1 cache is divided into two sections: one dedicated to caching instructions (pre-
decoded instructions) and the other to caching data. The L2 cache is a unified data and instruction cache
located on the processor chip. P6 family processors do not implement a trace cache.
• Pentium® processors — The L1 cache has the same structure as on P6 family processors. There is no trace
cache. The L2 cache is a unified data and instruction cache external to the processor chip on earlier Pentium
processors and implemented on the processor chip in later Pentium processors. For Pentium processors where
the L2 cache is external to the processor, access to the cache is through the system bus.
For Intel Core i7 processors and processors based on Intel Core, Intel Atom, and Intel NetBurst microarchitectures,
Intel Core Duo, Intel Core Solo and Pentium M processors, the cache lines for the L1 and L2 caches (and L3 caches
if supported) are 64 bytes wide. The processor always reads a cache line from system memory beginning on a 64-
byte boundary. (A 64-byte aligned cache line begins at an address with its 6 least-significant bits clear.) A cache
11-4 Vol. 3A
MEMORY CACHE CONTROL
line can be filled from memory with a 8-transfer burst transaction. The caches do not support partially-filled cache
lines, so caching even a single doubleword requires caching an entire line.
The L1 and L2 cache lines in the P6 family and Pentium processors are 32 bytes wide, with cache line reads from
system memory beginning on a 32-byte boundary (5 least-significant bits of a memory address clear.) A cache line
can be filled from memory with a 4-transfer burst transaction. Partially-filled cache lines are not supported.
The trace cache in processors based on Intel NetBurst microarchitecture is available in all execution modes:
protected mode, system management mode (SMM), and real-address mode. The L1,L2, and L3 caches are also
available in all execution modes; however, use of them must be handled carefully in SMM (see Section 34.4.2,
“SMRAM Caching”).
The TLBs store the most recently used page-directory and page-table entries. They speed up memory accesses
when paging is enabled by reducing the number of memory accesses that are required to read the page tables
stored in system memory. The TLBs are divided into four groups: instruction TLBs for 4-KByte pages, data TLBs for
4-KByte pages; instruction TLBs for large pages (2-MByte, 4-MByte or 1-GByte pages), and data TLBs for large
pages. The TLBs are normally active only in protected mode with paging enabled. When paging is disabled or the
processor is in real-address mode, the TLBs maintain their contents until explicitly or implicitly flushed (see Section
11.9, “Invalidating the Translation Lookaside Buffers (TLBs)”).
Processors based on Intel Core microarchitectures implement one level of instruction TLB and two levels of data
TLB. Intel Core i7 processor provides a second-level unified TLB.
The store buffer is associated with the processors instruction execution units. It allows writes to system memory
and/or the internal caches to be saved and in some cases combined to optimize the processor’s bus accesses. The
store buffer is always enabled in all execution modes.
The processor’s caches are for the most part transparent to software. When enabled, instructions and data flow
through these caches without the need for explicit software control. However, knowledge of the behavior of these
caches may be useful in optimizing software performance. For example, knowledge of cache dimensions and
replacement algorithms gives an indication of how large of a data structure can be operated on at once without
causing cache thrashing.
In multiprocessor systems, maintenance of cache consistency may, in rare circumstances, require intervention by
system software. For these rare cases, the processor provides privileged cache control instructions for use in
flushing caches and forcing memory ordering.
There are several instructions that software can use to improve the performance of the L1, L2, and L3 caches,
including the PREFETCHh, CLFLUSH, and CLFLUSHOPT instructions and the non-temporal move instructions
(MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD). The use of these instructions are discussed in Section
11.5.5, “Cache Management Instructions.”
Vol. 3A 11-5
MEMORY CACHE CONTROL
When operating in an MP system, IA-32 processors (beginning with the Intel486 processor) and Intel 64 processors
have the ability to snoop other processor’s accesses to system memory and to their internal caches. They use this
snooping ability to keep their internal caches consistent both with system memory and with the caches in other
processors on the bus. For example, in the Pentium and P6 family processors, if through snooping one processor
detects that another processor intends to write to a memory location that it currently has cached in shared state,
the snooping processor will invalidate its cache line forcing it to perform a cache line fill the next time it accesses
the same memory location.
Beginning with the P6 family processors, if a processor detects (through snooping) that another processor is trying
to access a memory location that it has modified in its cache, but has not yet written back to system memory, the
snooping processor will signal the other processor (by means of the HITM# signal) that the cache line is held in
modified state and will preform an implicit write-back of the modified data. The implicit write-back is transferred
directly to the initial requesting processor and snooped by the memory controller to assure that system memory
has been updated. Here, the processor with the valid data may pass the data to the other processors without actu-
ally writing it to system memory; however, it is the responsibility of the memory controller to snoop this operation
and update memory.
NOTE
The behavior of FP and SSE/SSE2 operations on operands in UC memory is implementation
dependent. In some implementations, accesses to UC memory may occur more than once. To
ensure predictable behavior, use loads and stores of general purpose registers to access UC
memory that may have read or write side effects.
• Uncacheable (UC-) — Has same characteristics as the strong uncacheable (UC) memory type, except that
this memory type can be overridden by programming the MTRRs for the WC memory type. This memory type
is available in processor families starting from the Pentium III processors and can only be selected through the
PAT.
11-6 Vol. 3A
MEMORY CACHE CONTROL
• Write Combining (WC) — System memory locations are not cached (as with uncacheable memory) and
coherency is not enforced by the processor’s bus coherency protocol. Speculative reads are allowed. Writes
may be delayed and combined in the write combining buffer (WC buffer) to reduce memory accesses. If the WC
buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an
SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt
occurrence, or an execution of a LOCK instruction (including one with an XACQUIRE or XRELEASE prefix). In
addition, an execution of the XEND instruction (to end a transactional region) evicts any writes that were
buffered before the corresponding execution of the XBEGIN instruction (to begin the transactional region)
before evicting any writes that were performed inside the transactional region.
This type of cache-control is appropriate for video frame buffers, where the order of writes is unimportant as
long as the writes update memory so they can be seen on the graphics display. See Section 11.3.1, “Buffering
of Write Combining Memory Locations,” for more information about caching the WC memory type. This memory
type is available in the Pentium Pro and Pentium II processors by programming the MTRRs; or in processor
families starting from the Pentium III processors by programming the MTRRs or by selecting it through the PAT.
• Write-through (WT) — Writes and reads to and from system memory are cached. Reads come from cache
lines on cache hits; read misses cause cache fills. Speculative reads are allowed. All writes are written to a
cache line (when possible) and through to system memory. When writing through to memory, invalid cache
lines are never filled, and valid cache lines are either filled or invalidated. Write combining is allowed. This type
of cache-control is appropriate for frame buffers or when there are devices on the system bus that access
system memory, but do not perform snooping of memory accesses. It enforces coherency between caches in
the processors and system memory.
• Write-back (WB) — Writes and reads to and from system memory are cached. Reads come from cache lines
on cache hits; read misses cause cache fills. Speculative reads are allowed. Write misses cause cache line fills
(in processor families starting with the P6 family processors), and writes are performed entirely in the cache,
when possible. Write combining is allowed. The write-back memory type reduces bus traffic by eliminating
many unnecessary writes to system memory. Writes to a cache line are not immediately forwarded to system
memory; instead, they are accumulated in the cache. The modified cache lines are written to system memory
later, when a write-back operation is performed. Write-back operations are triggered when cache lines need to
be deallocated, such as when new cache lines are being allocated in a cache that is already full. They also are
triggered by the mechanisms used to maintain cache consistency. This type of cache-control provides the best
performance, but it requires that all devices that access system memory on the system bus be able to snoop
memory accesses to insure system memory and cache coherency.
• Write protected (WP) — Reads come from cache lines when possible, and read misses cause cache fills.
Writes are propagated to the system bus and cause corresponding cache lines on all processors on the bus to
be invalidated. Speculative reads are allowed. This memory type is available in processor families starting from
the P6 family processors by programming the MTRRs (see Table 11-6).
Table 11-3 shows which of these caching methods are available in the Pentium, P6 Family, Pentium 4, and Intel
Xeon processors.
Table 11-3. Methods of Caching Available in Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4,
Intel Xeon, P6 Family, and Pentium Processors
Memory Type Intel Core 2 Duo, Intel Atom, Intel Core Duo, P6 Family Pentium
Pentium M, Pentium 4 and Intel Xeon Processors Processors Processor
Strong Uncacheable (UC) Yes Yes Yes
Uncacheable (UC-) Yes Yes* No
Write Combining (WC) Yes Yes No
Write Through (WT) Yes Yes Yes
Write Back (WB) Yes Yes Yes
Write Protected (WP) Yes Yes No
NOTE:
* Introduced in the Pentium III processor; not available in the Pentium Pro or Pentium II processors
Vol. 3A 11-7
MEMORY CACHE CONTROL
11-8 Vol. 3A
MEMORY CACHE CONTROL
lines in the processor's L2 and L3 caches and L1 data cache. Therefore, systems should use write-combining
memory for frame buffers whenever possible.
Software can use page-level cache control, to assign appropriate effective memory types when software will not
access data structures in ways that benefit from write-back caching. For example, software may read a large data
structure once and not access the structure again until the structure is rewritten by another agent. Such a large
data structure should be marked as uncacheable, or reading it will evict cached lines that the processor will be
referencing again.
A similar example would be a write-only data structure that is written to (to export the data to another agent), but
never read by software. Such a structure can be marked as uncacheable, because software never reads the values
that it writes (though as uncacheable memory, it will be written using partial writes, while as write-back memory,
it will be written using line writes, which may not occur until the other agent reads the structure and triggers
implicit write-backs).
On the Pentium III, Pentium 4, and more recent processors, new instructions are provided that give software
greater control over the caching, prefetching, and the write-back characteristics of data. These instructions allow
software to use weakly ordered or processor ordered memory types to improve processor performance, but when
necessary to force strong ordering on memory reads and/or writes. They also allow software greater control over
the caching of data. For a description of these instructions and there intended use, see Section 11.5.5, “Cache
Management Instructions.”
Vol. 3A 11-9
MEMORY CACHE CONTROL
The L1 instruction cache in P6 family processors implements only the “SI” part of the MESI protocol, because the
instruction cache is not writable. The instruction cache monitors changes in the data cache to maintain consistency
between the caches when instructions are modified. See Section 11.6, “Self-Modifying Code,” for more information
on the implications of caching instructions.
11-10 Vol. 3A
MEMORY CACHE CONTROL
CR4
P
G
E
Enables global pages
CR3 designated with G flag Physical Memory
P P
FFFFFFFFH2
C W
D T
Control caching of PAT4
page directory
PAT controls caching
Page-Directory or of virtual memory
pages
CR0 Page-Table Entry
C N P4 P P
A G1 C W
D W T D T
Store Buffer
TLBs
Figure 11-3. Cache-Control Registers and Bits Available in Intel 64 and IA-32 Processors
Vol. 3A 11-11
MEMORY CACHE CONTROL
Yes Yes
• Shared lines remain shared after write hit. Yes Yes
• Write misses access memory. Yes Yes
• Invalidation is inhibited when snooping; but is allowed with INVD and WBINVD instructions. Yes Yes
• External snoop traffic is supported.
No Yes
NOTES:
1. The L2/L3 column in this table is definitive for the Pentium 4, Intel Xeon, and P6 family processors. It is intended to represent what
could be implemented in a system based on a Pentium processor with an external, platform specific, write-back L2 cache.
2. The Pentium 4 and more recent processor families do not support this mode; setting the CD and NW bits to 1 selects the no-fill
cache mode.
3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor, caching is disabled.
11-12 Vol. 3A
MEMORY CACHE CONTROL
• NW flag, bit 29 of control register CR0 — Controls the write policy for system memory locations (see
Section 2.5, “Control Registers”). If the NW and CD flags are clear, write-back is enabled for the whole of
system memory, but may be restricted for individual pages or regions of memory by other cache-control
mechanisms. Table 11-5 shows how the other combinations of CD and NW flags affects caching.
NOTES
For the Pentium 4 and Intel Xeon processors, the NW flag is a don’t care flag; that is, when the CD
flag is set, the processor uses the no-fill cache mode, regardless of the setting of the NW flag.
For Intel Atom processors, the NW flag is a don’t care flag; that is, when the CD flag is set, the
processor disables caching, regardless of the setting of the NW flag.
For the Pentium processor, when the L1 cache is disabled (the CD and NW flags in control register
CR0 are set), external snoops are accepted in DP (dual-processor) systems and inhibited in unipro-
cessor systems.
When snoops are inhibited, address parity is not checked and APCHK# is not asserted for a corrupt
address; however, when snoops are accepted, address parity is checked and APCHK# is asserted
for corrupt addresses.
• PCD and PWT flags in paging-structure entries — Control the memory type used to access paging
structures and pages (see Section 4.9, “Paging and Memory Typing”).
• PCD and PWT flags in control register CR3 — Control the memory type used to access the first paging
structure of the current paging-structure hierarchy (see Section 4.9, “Paging and Memory Typing”).
• G (global) flag in the page-directory and page-table entries (introduced to the IA-32 architecture in
the P6 family processors) — Controls the flushing of TLB entries for individual pages. See Section 4.10,
“Caching Translation Information,” for more information about this flag.
• PGE (page global enable) flag in control register CR4 — Enables the establishment of global pages with
the G flag. See Section 4.10, “Caching Translation Information,” for more information about this flag.
• Memory type range registers (MTRRs) (introduced in P6 family processors) — Control the type of
caching used in specific regions of physical memory. Any of the caching types described in Section 11.3,
“Methods of Caching Available,” can be selected. See Section 11.11, “Memory Type Range Registers (MTRRs),”
for a detailed description of the MTRRs.
• Page Attribute Table (PAT) MSR (introduced in the Pentium III processor) — Extends the memory
typing capabilities of the processor to permit memory types to be assigned on a page-by-page basis (see
Section 11.12, “Page Attribute Table (PAT)”).
• Third-Level Cache Disable flag, bit 6 of the IA32_MISC_ENABLE MSR (Available only in processors
based on Intel NetBurst microarchitecture) — Allows the L3 cache to be disabled and enabled, indepen-
dently of the L1 and L2 caches.
• KEN# and WB/WT# pins (Pentium processor) — Allow external hardware to control the caching method
used for specific areas of memory. They perform similar (but not identical) functions to the MTRRs in the P6
family processors.
• PCD and PWT pins (Pentium processor) — These pins (which are associated with the PCD and PWT flags in
control register CR3 and in the page-directory and page-table entries) permit caching in an external L2 cache
to be controlled on a page-by-page basis, consistent with the control exercised on the L1 cache of these
processors. The P6 and more recent processor families do not provide these pins because the L2 cache in
internal to the chip package.
Vol. 3A 11-13
MEMORY CACHE CONTROL
true; that is, if a page-level caching control designates a page as uncacheable, an MTRR cannot be used to make
the page cacheable.
In cases where there is a overlap in the assignment of the write-back and write-through caching policies to a page
and a region of memory, the write-through policy takes precedence. The write-combining policy (which can only be
assigned through an MTRR or the PAT) takes precedence over either write-through or write-back.
The selection of memory types at the page level varies depending on whether PAT is being used to select memory
types for pages, as described in the following sections.
On processors based on Intel NetBurst microarchitecture, the third-level cache can be disabled by bit 6 of the
IA32_MISC_ENABLE MSR. Using IA32_MISC_ENABLE[bit 6] takes precedence over the CD flag, MTRRs, and PAT
for the L3 cache in those processors. That is, when the third-level cache disable flag is set (cache disabled), the
other cache controls have no affect on the L3 cache; when the flag is clear (enabled), the cache controls have the
same affect on the L3 cache as they have on the L1 and L2 caches.
IA32_MISC_ENABLE[bit 6] is not supported in Intel Core i7 processors, nor processors based on Intel Core, and
Intel Atom microarchitectures.
11.5.2.1 Selecting Memory Types for Pentium Pro and Pentium II Processors
The Pentium Pro and Pentium II processors do not support the PAT. Here, the effective memory type for a page is
selected with the MTRRs and the PCD and PWT bits in the page-table or page-directory entry for the page. Table
11-6 describes the mapping of MTRR memory types and page-level caching attributes to effective memory types,
when normal caching is in effect (the CD and NW flags in control register CR0 are clear). Combinations that appear
in gray are implementation-defined for the Pentium Pro and Pentium II processors. System designers are encour-
aged to avoid these implementation-defined combinations.
Table 11-6. Effective Page-Level Memory Type for Pentium Pro and Pentium II Processors
MTRR Memory Type1 PCD Value PWT Value Effective Memory Type
UC X X UC
WC 0 0 WC
0 1 WC
1 0 WC
1 1 UC
WT 0 X WT
1 X UC
WP 0 0 WP
0 1 WP
1 0 WC
1 1 UC
WB 0 0 WB
0 1 WT
1 X UC
NOTE:
1. These effective memory types also apply to the Pentium 4, Intel Xeon, and Pentium III processors when the PAT bit is not used
(set to 0) in page-table and page-directory entries.
When normal caching is in effect, the effective memory type shown in Table 11-6 is determined using the following
rules:
1. If the PCD and PWT attributes for the page are both 0, then the effective memory type is identical to the
MTRR-defined memory type.
11-14 Vol. 3A
MEMORY CACHE CONTROL
2. If the PCD flag is set, then the effective memory type is UC.
3. If the PCD flag is clear and the PWT flag is set, the effective memory type is WT for the WB memory type and
the MTRR-defined memory type for all other memory types.
4. Setting the PCD and PWT flags to opposite values is considered model-specific for the WP and WC memory
types and architecturally-defined for the WB, WT, and UC memory types.
11.5.2.2 Selecting Memory Types for Pentium III and More Recent Processor Families
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Intel Core Solo, Pentium M, Pentium 4, Intel Xeon, and Pentium
III processors use the PAT to select effective page-level memory types. Here, a memory type for a page is selected
by the MTRRs and the value in a PAT entry that is selected with the PAT, PCD and PWT bits in a page-table or page-
directory entry (see Section 11.12.3, “Selecting a Memory Type from the PAT”). Table 11-7 describes the mapping
of MTRR memory types and PAT entry types to effective memory types, when normal caching is in effect (the CD
and NW flags in control register CR0 are clear).
Table 11-7. Effective Page-Level Memory Types for Pentium III and More Recent Processor Families
MTRR Memory Type PAT Entry Value Effective Memory Type
UC UC UC1
UC- UC1
WC WC
WT UC1
WB UC1
WP UC1
WC UC UC2
UC- WC
WC WC
WT UC2,3
WB WC
WP UC2,3
WT UC UC2
UC- UC2
WC WC
WT WT
WB WT
WP WP3
WB UC UC2
UC- UC2
WC WC
WT WT
WB WB
WP WP
Vol. 3A 11-15
MEMORY CACHE CONTROL
Table 11-7. Effective Page-Level Memory Types for Pentium III and More Recent Processor Families (Contd.)
MTRR Memory Type PAT Entry Value Effective Memory Type
WP UC UC2
UC- WC3
WC WC
WT WT3
WB WP
WP WP
NOTES:
1. The UC attribute comes from the MTRRs and the processors are not required to snoop their caches since the data could never have
been cached. This attribute is preferred for performance reasons.
2. The UC attribute came from the page-table or page-directory entry and processors are required to check their caches because the
data may be cached due to page aliasing, which is not recommended.
3. These combinations were specified as “undefined” in previous editions of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual. However, all processors that support both the PAT and the MTRRs determine the effective page-level memory
types for these combinations as given.
NOTES
Setting the CD flag in control register CR0 modifies the processor’s caching behaviour as indicated
in Table 11-5, but setting the CD flag alone may not be sufficient across all processor families to
force the effective memory type for all physical memory to be UC nor does it force strict memory
ordering, due to hardware implementation variations across different processor families. To force
the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either
program the MTRRs for all physical memory to be UC memory type or disable all MTRRs.
For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been
executed, the cache lines containing the code between the end of the WBINVD instruction and
before the MTRRS have actually been disabled may be retained in the cache hierarchy. Here, to
11-16 Vol. 3A
MEMORY CACHE CONTROL
remove code from the cache completely, a second WBINVD instruction must be executed after the
MTRRs have been disabled.
For Intel Atom processors, setting the CD flag forces all physical memory to observe UC semantics
(without requiring memory type of physical memory to be set explicitly). Consequently, software
does not need to issue a second WBINVD as some other processor generations might require.
Vol. 3A 11-17
MEMORY CACHE CONTROL
11-18 Vol. 3A
MEMORY CACHE CONTROL
Because of speculative execution in the P6 and more recent processor families, the last MOV instruction performed
would place the value at physical location B000H into EBX, rather than the value at the new physical address
A000H. This situation is remedied by placing a TLB invalidation between the load and the store.
Vol. 3A 11-19
MEMORY CACHE CONTROL
• (Pentium 4, Intel Xeon, and later processors only.) Writing to control register CR4 to modify the PSE, PGE, or
PAE flag.
• Writing to control register CR4 to change the PCIDE flag from 1 to 0.
See Section 4.10, “Caching Translation Information,” for additional information about the TLBs.
NOTE
In multiple processor systems, the operating system must maintain MTRR consistency between all
the processors in the system (that is, all processors must use the same MTRR values). The P6 and
more recent processor families provide no hardware support for maintaining this consistency.
11-20 Vol. 3A
MEMORY CACHE CONTROL
Physical Memory
FFFFFFFFH
Variable ranges
(from 4 KBytes to
maximum size of
physical memory)
100000H
64 fixed ranges FFFFFH
(4 KBytes each) 256 KBytes
C0000H
16 fixed ranges BFFFFH
256 KBytes
(16 KBytes each) 80000H
7FFFFH
8 fixed ranges
(64-KBytes each) 512 KBytes
0
Vol. 3A 11-21
MEMORY CACHE CONTROL
• VCNT (variable range registers count) field, bits 0 through 7 — Indicates the number of variable ranges
implemented on the processor.
• FIX (fixed range registers supported) flag, bit 8 — Fixed range MTRRs (IA32_MTRR_FIX64K_00000
through IA32_MTRR_FIX4K_0F8000) are supported when set; no fixed range registers are supported when
clear.
• WC (write combining) flag, bit 10 — The write-combining (WC) memory type is supported when set; the
WC type is not supported when clear.
• SMRR (System-Management Range Register) flag, bit 11 — The system-management range register
(SMRR) interface is supported when bit 11 is set; the SMRR interface is not supported when clear.
Bit 9 and bits 12 through 63 in the IA32_MTRRCAP MSR are reserved. If software attempts to write to the
IA32_MTRRCAP MSR, a general-protection exception (#GP) is generated.
Software must read IA32_MTRRCAP VCNT field to determine the number of variable MTRRs and query other
feature bits in IA32_MTRRCAP to determine additional capabilities that are supported in a processor. For example,
some processors may report a value of ‘8’ in the VCNT field, other processors may report VCNT with different
values.
63 11 10 9 8 7 0
W F
Reserved C I VCNT
X
11-22 Vol. 3A
MEMORY CACHE CONTROL
63 12 11 10 9 8 7 0
Reserved E F Type
E
E — MTRR enable/disable
FE — Fixed-range MTRRs enable/disable
Type — Default memory type
Reserved
• FE (fixed MTRRs enabled) flag, bit 10 — Fixed-range MTRRs are enabled when set; fixed-range MTRRs are
disabled when clear. When the fixed-range MTRRs are enabled, they take priority over the variable-range
MTRRs when overlaps in ranges occur. If the fixed-range MTRRs are disabled, the variable-range MTRRs can
still be used and can map the range ordinarily covered by the fixed-range MTRRs.
• E (MTRRs enabled) flag, bit 11 — MTRRs are enabled when set; all MTRRs are disabled when clear, and the
UC memory type is applied to all of physical memory. When this flag is set, the FE flag can disable the fixed-
range MTRRs; when the flag is clear, the FE flag has no affect. When the E flag is set, the type specified in the
default memory type field is used for areas of memory not already mapped by either a fixed or variable MTRR.
Bits 8 and 9, and bits 12 through 63, in the IA32_MTRR_DEF_TYPE MSR are reserved; the processor generates a
general-protection exception (#GP) if software attempts to write nonzero values to them.
Vol. 3A 11-23
MEMORY CACHE CONTROL
Figure 11-7 shows flags and fields in these registers. The functions of these flags and fields are:
• Type field, bits 0 through 7 — Specifies the memory type for the range (see Table 11-8 for the encoding of
this field).
• PhysBase field, bits 12 through (MAXPHYADDR-1) — Specifies the base address of the address range.
This 24-bit value, in the case where MAXPHYADDR is 36 bits, is extended by 12 bits at the low end to form the
base address (this automatically aligns the address on a 4-KByte boundary).
• PhysMask field, bits 12 through (MAXPHYADDR-1) — Specifies a mask (24 bits if the maximum physical
address size is 36 bits, 28 bits if the maximum physical address size is 40 bits). The mask determines the range
of the region being mapped, according to the following relationships:
— Address_Within_Range AND PhysMask = PhysBase AND PhysMask
— This value is extended by 12 bits at the low end to form the mask value. For more information: see Section
11.11.3, “Example Base and Mask Calculations.”
— The width of the PhysMask field depends on the maximum physical address size supported by the
processor.
CPUID.80000008H reports the maximum physical address size supported by the processor. If
CPUID.80000008H is not available, software may assume that the processor supports a 36-bit physical
address size (then PhysMask is 24 bits wide and the upper 28 bits of IA32_MTRR_PHYSMASKn are
reserved). See the Note below.
• V (valid) flag, bit 11 — Enables the register pair when set; disables register pair when clear.
11-24 Vol. 3A
MEMORY CACHE CONTROL
IA32_MTRR_PHYSBASEn Register
63 MAXPHYADDR 12 11 8 7 0
IA32_MTRR_PHYSMASKn Register
63 MAXPHYADDR 12 11 10 0
Reserved
All other bits in the IA32_MTRR_PHYSBASEn and IA32_MTRR_PHYSMASKn registers are reserved; the processor
generates a general-protection exception (#GP) if software attempts to write to them.
Some mask values can result in ranges that are not continuous. In such ranges, the area not mapped by the mask
value is set to the default memory type, unless some other MTRR specifies a type for that range. Intel does not
encourage the use of “discontinuous” ranges.
NOTE
It is possible for software to parse the memory descriptions that BIOS provides by using the
ACPI/INT15 e820 interface mechanism. This information then can be used to determine how
MTRRs are initialized (for example: allowing the BIOS to define valid memory ranges and the
maximum memory range supported by the platform, including the processor).
See Section 11.11.4.1, “MTRR Precedences,” for information on overlapping variable MTRR ranges.
1. For some processor models, these MSRs can be accessed by RDMSR and WRMSR only if the SMRR interface has been enabled using
a model-specific bit in the IA32_FEATURE_CONTROL MSR.
Vol. 3A 11-25
MEMORY CACHE CONTROL
• Type field, bits 0 through 7 — Specifies the memory type for the range (see Table 11-8 for the encoding of
this field).
• PhysBase field, bits 12 through 31 — Specifies the base address of the address range. The address must be
less than 4 GBytes and is automatically aligned on a 4-KByte boundary.
• PhysMask field, bits 12 through 31 — Specifies a mask that determines the range of the region being
mapped, according to the following relationships:
— Address_Within_Range AND PhysMask = PhysBase AND PhysMask
— This value is extended by 12 bits at the low end to form the mask value. For more information: see Section
11.11.3, “Example Base and Mask Calculations.”
• V (valid) flag, bit 11 — Enables the register pair when set; disables register pair when clear.
Before attempting to access these SMRR registers, software must test bit 11 in the IA32_MTRRCAP register. If
SMRR is not supported, reads from or writes to registers cause general-protection exceptions.
When the valid flag in the IA32_SMRR_PHYSMASK MSR is 1, accesses to the specified address range are treated as
follows:
• If the logical processor is in SMM, accesses uses the memory type in the IA32_SMRR_PHYSBASE MSR.
• If the logical processor is not in SMM, write accesses are ignored and read accesses return a fixed value for each
byte. The uncacheable memory type (UC) is used in this case.
The above items apply even if the address range specified overlaps with a range specified by the MTRRs.
IA32_SMRR_PHYSBASE Register
63 31 12 11 8 7 0
IA32_SMRR_PHYSMASK Register
63 31 12 11 10 0
Reserved
11-26 Vol. 3A
MEMORY CACHE CONTROL
To map the address range from 400000H to 7FFFFFH (4 MBytes to 8 MBytes), a base value of 000400H is entered
in the PhysBase field and a mask value of FFFC00H is entered in the PhysMask field.
11.11.3.1 Base and Mask Calculations for Greater-Than 36-bit Physical Address Support
For Intel 64 and IA-32 processors that support greater than 36 bits of physical address size, software should query
CPUID.80000008H to determine the maximum physical address. See the example.
Example 11-3. Setting-Up Memory for a System with a 40-Bit Address Size
If a processor supports 40-bits of physical address size, then the PhysMask field (in IA32_MTRR_PHYSMASKn
registers) is 28 bits instead of 24 bits. For this situation, Example 11-2 should be modified as follows:
Vol. 3A 11-27
MEMORY CACHE CONTROL
11-28 Vol. 3A
MEMORY CACHE CONTROL
Vol. 3A 11-29
MEMORY CACHE CONTROL
If the processor does not support MTRRs, the function returns UNSUPPORTED. If the MTRRs are not enabled, then
the UC memory type is returned. If more than one memory type corresponds to the specified range, a status of
MIXED_TYPES is returned. Otherwise, the memory type defined for the range (UC, WC, WT, WB, or WP) is
returned.
The pseudocode for the Get4KMemType() function in Example 11-5 obtains the memory type for a single 4-KByte
range at a given physical address. The sample code determines whether an PHY_ADDRESS falls within a fixed
range by comparing the address with the known fixed ranges: 0 to 7FFFFH (64-KByte regions), 80000H to BFFFFH
(16-KByte regions), and C0000H to FFFFFH (4-KByte regions). If an address falls within one of these ranges, the
appropriate bits within one of its MTRRs determine the memory type.
11-30 Vol. 3A
MEMORY CACHE CONTROL
pre_mtrr_change()
BEGIN
disable interrupts;
Save current value of CR4;
disable and flush caches;
Vol. 3A 11-31
MEMORY CACHE CONTROL
flush TLBs;
disable MTRRs;
IF multiprocessing
THEN maintain consistency through IPIs;
FI;
END
post_mtrr_change()
BEGIN
flush caches and TLBs;
enable MTRRs;
enable caches;
restore value of CR4;
enable interrupts;
END
The physical address to variable range mapping algorithm in the MemTypeSet function detects conflicts with
current variable range registers by cycling through them and determining whether the physical address in question
matches any of the current ranges. During this scan, the algorithm can detect whether any current variable ranges
overlap and can be concatenated into a single range.
The pre_mtrr_change() function disables interrupts prior to changing the MTRRs, to avoid executing code with a
partially valid MTRR setup. The algorithm disables caching by setting the CD flag and clearing the NW flag in control
register CR0. The caches are invalidated using the WBINVD instruction. The algorithm flushes all TLB entries either
by clearing the page-global enable (PGE) flag in control register CR4 (if PGE was already set) or by updating control
register CR3 (if PGE was already clear). Finally, it disables MTRRs by clearing the E flag in the
IA32_MTRR_DEF_TYPE MSR.
After the memory type is updated, the post_mtrr_change() function re-enables the MTRRs and again invalidates
the caches and TLBs. This second invalidation is required because of the processor's aggressive prefetch of both
instructions and data. The algorithm restores interrupts and re-enables caching by setting the CD flag.
An operating system can batch multiple MTRR updates so that only a single pair of cache invalidations occur.
11-32 Vol. 3A
MEMORY CACHE CONTROL
7. If the PGE flag is clear in control register CR4, flush all TLBs by executing a MOV from control register CR3 to
another register and then a MOV from that register back to CR3.
8. Disable all range registers (by clearing the E flag in register MTRRdefType). If only variable ranges are being
modified, software may clear the valid bits for the affected register pairs instead.
9. Update the MTRRs.
10. Enable all range registers (by setting the E flag in register MTRRdefType). If only variable-range registers were
modified and their individual valid bits were cleared, then set the valid bits for the affected ranges instead.
11. Flush all caches and all TLBs a second time. (The TLB flush is required for Pentium 4, Intel Xeon, and P6 family
processors. Executing the WBINVD instruction is not needed when using Pentium 4, Intel Xeon, and P6 family
processors, but it may be needed in future systems.)
12. Enter the normal cache mode to re-enable caching. (Set the CD and NW flags in control register CR0 to 0.)
13. Set PGE flag in control register CR4, if cleared in Step 6 (above).
14. Wait for all processors to reach this point.
15. Enable interrupts.
Vol. 3A 11-33
MEMORY CACHE CONTROL
31 27 26 24 23 19 18 16 15 11 10 8 7 3 2 0
Reserved PA3 Reserved PA2 Reserved PA1 Reserved PA0
63 59 58 56 55 51 50 48 47 43 42 40 39 35 34 32
Reserved PA7 Reserved PA6 Reserved PA5 Reserved PA4
Note that for the P6 family processors, the IA32_PAT MSR is named the PAT MSR.
11-34 Vol. 3A
MEMORY CACHE CONTROL
Table 11-11. Selection of PAT Entries with PAT, PCD, and PWT Flags
PAT PCD PWT PAT Entry
0 0 0 PAT0
0 0 1 PAT1
0 1 0 PAT2
0 1 1 PAT3
1 0 0 PAT4
1 0 1 PAT5
1 1 0 PAT6
1 1 1 PAT7
Table 11-12. Memory Type Setting of PAT Entries Following a Power-up or Reset
PAT Entry Memory Type Following Power-up or Reset
PAT0 WB
PAT1 WT
PAT2 UC-
PAT3 UC
PAT4 WB
PAT5 WT
PAT6 UC-
PAT7 UC
The values in all the entries of the PAT can be changed by writing to the IA32_PAT MSR using the WRMSR instruc-
tion. The IA32_PAT MSR is read and write accessible (use of the RDMSR and WRMSR instructions, respectively) to
software operating at a CPL of 0. Table 11-10 shows the allowable encoding of the entries in the PAT. Attempting to
write an undefined memory type encoding into the PAT causes a general-protection (#GP) exception to be gener-
ated.
The operating system is responsible for insuring that changes to a PAT entry occur in a manner that maintains the
consistency of the processor caches and translation lookaside buffers (TLB). This is accomplished by following the
procedure as specified in Section 11.11.8, “MTRR Considerations in MP Systems,” for changing the value of an
MTRR in a multiple processor system. It requires a specific sequence of operations that includes flushing the
processors caches and TLBs.
The PAT allows any memory type to be specified in the page tables, and therefore it is possible to have a single
physical page mapped to two or more different linear addresses, each with different memory types. Intel does not
support this practice because it may lead to undefined operations that can result in a system failure. In particular,
a WC page must never be aliased to a cacheable page because WC writes may not check the processor caches.
Vol. 3A 11-35
MEMORY CACHE CONTROL
When remapping a page that was previously mapped as a cacheable memory type to a WC page, an operating
system can avoid this type of aliasing by doing the following:
1. Remove the previous mapping to a cacheable memory type in the page tables; that is, make them not
present.
2. Flush the TLBs of processors that may have used the mapping, even speculatively.
3. Create a new mapping to the same physical address with a new memory type, for instance, WC.
4. Flush the caches on all processors that may have used the mapping previously. Note on processors that support
self-snooping, CPUID feature flag bit 27, this step is unnecessary.
Operating systems that use a page directory as a page table (to map large pages) and enable page size extensions
must carefully scrutinize the use of the PAT index bit for the 4-KByte page-table entries. The PAT index bit for a
page-table entry (bit 7) corresponds to the page size bit in a page-directory entry. Therefore, the operating system
can only use PAT entries PA0 through PA3 when setting the caching type for a page table that is also used as a page
directory. If the operating system attempts to use PAT entries PA4 through PA7 when using this memory as a page
table, it effectively sets the PS bit for the access to this memory as a page directory.
For compatibility with earlier IA-32 processors that do not support the PAT, care should be taken in selecting the
encodings for entries in the PAT (see Section 11.12.5, “PAT Compatibility with Earlier IA-32 Processors”).
11-36 Vol. 3A
15.Updates to Chapter 15, Volume 3B
Change bars and green text show changes to Chapter 15 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3B: System Programming Guide, Part 2.
------------------------------------------------------------------------------------------
Changes to this chapter: Update to Section 15.8 “Machine-Check Initialization”; removed unclear sentence at the
end of the section.
This chapter describes the machine-check architecture and machine-check exception mechanism found in the
Pentium 4, Intel Xeon, Intel Atom, and P6 family processors. See Chapter 6, “Interrupt 18—Machine-Check Excep-
tion (#MC),” for more information on machine-check exceptions. A brief description of the Pentium processor’s
machine check capability is also given.
Additionally, a signaling mechanism for software to respond to hardware corrected machine check error is covered.
Vol. 3B 15-1
MACHINE-CHECK ARCHITECTURE
63 0 63 0
IA32_MCG_STATUS MSR IA32_MCi_STATUS MSR
63 0 63 0
IA32_MCG_CTL MSR IA32_MCi_ADDR MSR
63 0 63 0
IA32_MCG_EXT_CTL MSR IA32_MCi_MISC MSR
63 0
IA32_MCi_CTL2 MSR
Each error-reporting bank is associated with a specific hardware unit (or group of hardware units) in the processor.
Use RDMSR and WRMSR to read and to write these registers.
15-2 Vol. 3B
MACHINE-CHECK ARCHITECTURE
63 27 26 25 24 23 16 15 12 11 10 9 8 7 0
Reserved Count
MCG_LMCE_P[27]
MCG_ELOG_P[26]
MCG_EMC_P[25]
MCG_SER_P[24]
MCG_EXT_CNT[23:16]
MCG_TES_P[11]
MCG_CMCI_P[10]
MCG_EXT_P[9]
MCG_CTL_P[8]
Where:
• Count field, bits 7:0 — Indicates the number of hardware unit error-reporting banks available in a particular
processor implementation.
• MCG_CTL_P (control MSR present) flag, bit 8 — Indicates that the processor implements the
IA32_MCG_CTL MSR when set; this register is absent when clear.
• MCG_EXT_P (extended MSRs present) flag, bit 9 — Indicates that the processor implements the extended
machine-check state registers found starting at MSR address 180H; these registers are absent when clear.
• MCG_CMCI_P (Corrected MC error counting/signaling extension present) flag, bit 10 — Indicates
(when set) that extended state and associated MSRs necessary to support the reporting of an interrupt on a
corrected MC error event and/or count threshold of corrected MC errors, is present. When this bit is set, it does
not imply this feature is supported across all banks. Software should check the availability of the necessary
logic on a bank by bank basis when using this signaling capability (i.e. bit 30 settable in individual
IA32_MCi_CTL2 register).
• MCG_TES_P (threshold-based error status present) flag, bit 11 — Indicates (when set) that bits 56:53
of the IA32_MCi_STATUS MSR are part of the architectural space. Bits 56:55 are reserved, and bits 54:53 are
used to report threshold-based error status. Note that when MCG_TES_P is not set, bits 56:53 of the
IA32_MCi_STATUS MSR are model-specific.
• MCG_EXT_CNT, bits 23:16 — Indicates the number of extended machine-check state registers present. This
field is meaningful only when the MCG_EXT_P flag is set.
• MCG_SER_P (software error recovery support present) flag, bit 24 — Indicates (when set) that the
processor supports software error recovery (see Section 15.6), and IA32_MCi_STATUS MSR bits 56:55 are
used to report the signaling of uncorrected recoverable errors and whether software must take recovery
actions for uncorrected errors. Note that when MCG_TES_P is not set, bits 56:53 of the IA32_MCi_STATUS MSR
are model-specific. If MCG_TES_P is set but MCG_SER_P is not set, bits 56:55 are reserved.
• MCG_EMC_P (Enhanced Machine Check Capability) flag, bit 25 — Indicates (when set) that the
processor supports enhanced machine check capabilities for firmware first signaling.
• MCG_ELOG_P (extended error logging) flag, bit 26 — Indicates (when set) that the processor allows
platform firmware to be invoked when an error is detected so that it may provide additional platform specific
information in an ACPI format “Generic Error Data Entry” that augments the data included in machine check
bank registers.
For additional information about extended error logging interface, see
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/enhanced-mca-logging-
xeon-paper.pdf.
Vol. 3B 15-3
MACHINE-CHECK ARCHITECTURE
• MCG_LMCE_P (local machine check exception) flag, bit 27 — Indicates (when set) that the following
interfaces are present:
— an extended state LMCE_S (located in bit 3 of IA32_MCG_STATUS), and
— the IA32_MCG_EXT_CTL MSR, necessary to support Local Machine Check Exception (LMCE).
A non-zero MCG_LMCE_P indicates that, when LMCE is enabled as described in Section 15.3.1.5, some machine
check errors may be delivered to only a single logical processor.
The effect of writing to the IA32_MCG_CAP MSR is undefined.
63 3 2 1 0
M E R
C I I
Reserved I P P
P V V
Where:
• RIPV (restart IP valid) flag, bit 0 — Indicates (when set) that program execution can be restarted reliably
at the instruction pointed to by the instruction pointer pushed on the stack when the machine-check exception
is generated. When clear, the program cannot be reliably restarted at the pushed instruction pointer.
• EIPV (error IP valid) flag, bit 1 — Indicates (when set) that the instruction pointed to by the instruction
pointer pushed onto the stack when the machine-check exception is generated is directly associated with the
error. When this flag is cleared, the instruction pointed to may not be associated with the error.
• MCIP (machine check in progress) flag, bit 2 — Indicates (when set) that a machine-check exception was
generated. Software can set or clear this flag. The occurrence of a second Machine-Check Event while MCIP is
set will cause the processor to enter a shutdown state. For information on processor behavior in the shutdown
state, please refer to the description in Chapter 6, “Interrupt and Exception Handling”: “Interrupt 8—Double
Fault Exception (#DF)”.
• LMCE_S (local machine check exception signaled), bit 3 — Indicates (when set) that a local machine-
check exception was generated. This indicates that the current machine-check event was delivered to only this
logical processor.
Bits 63:04 in IA32_MCG_STATUS are reserved. An attempt to write to IA32_MCG_STATUS with any value other
than 0 would result in #GP.
15-4 Vol. 3B
MACHINE-CHECK ARCHITECTURE
63 1 0
Reserved
where
• LMCE_EN (local machine check exception enable) flag, bit 0 - System software sets this to allow
hardware to signal some MCEs to only a single logical processor. System software can set LMCE_EN only if the
platform software has configured IA32_FEATURE_CONTROL as described in Section 15.3.1.5.
Vol. 3B 15-5
MACHINE-CHECK ARCHITECTURE
63 62 61 3 2 1 0
E
E
6
3
E
E
6
2
E
E
6
1
..... E
E
0
2
E E
E E
0 0
1 0
NOTE
For P6 family processors, processors based on Intel Core microarchitecture (excluding those on
which on which CPUID reports DisplayFamily_DisplayModel as 06H_1AH and onward): the
operating system or executive software must not modify the contents of the IA32_MC0_CTL MSR.
This MSR is internally aliased to the EBL_CR_POWERON MSR and controls platform-specific error
handling features. System specific firmware (the BIOS) is responsible for the appropriate initial-
ization of the IA32_MC0_CTL MSR. P6 family processors only allow the writing of all 1s or all 0s to
the IA32_MCi_CTL MSR.
NOTE
Figure 15-6 depicts the IA32_MCi_STATUS MSR when IA32_MCG_CAP[24] = 1,
IA32_MCG_CAP[11] = 1 and IA32_MCG_CAP[10] = 1. When IA32_MCG_CAP[24] = 0 and
IA32_MCG_CAP[11] = 1, bits 56:55 is reserved and bits 54:53 for threshold-based error reporting.
When IA32_MCG_CAP[11] = 0, bits 56:53 are part of the “Other Information” field. The use of bits
54:53 for threshold-based error reporting began with Intel Core Duo processors, and is currently
used for cache memory. See Section 15.4, “Enhanced Cache Error reporting,” for more information.
When IA32_MCG_CAP[10] = 0, bits 52:38 are part of the “Other Information” field. The use of bits
52:38 for corrected MC error count is introduced with Intel 64 processor on which CPUID reports
DisplayFamily_DisplayModel as 06H_1AH.
Where:
• MCA (machine-check architecture) error code field, bits 15:0 — Specifies the machine-check archi-
tecture-defined error code for the machine-check error condition detected. The machine-check architecture-
defined error codes are guaranteed to be the same for all IA-32 processors that implement the machine-check
architecture. See Section 15.9, “Interpreting the MCA Error Codes,” and Chapter 16, “Interpreting Machine-
Check Error Codes”, for information on machine-check error codes.
• Model-specific error code field, bits 31:16 — Specifies the model-specific error code that uniquely
identifies the machine-check error condition detected. The model-specific error codes may differ among IA-32
processors for the same machine-check error condition. See Chapter 16, “Interpreting Machine-Check Error
Codes”for information on model-specific error codes.
• Reserved, Error Status, and Other Information fields, bits 56:32 —
• If IA32_MCG_CAP.MCG_EMC_P[bit 25] is 0, bits 37:32 contain “Other Information” that is implemen-
tation-specific and is not part of the machine-check architecture.
• If IA32_MCG_CAP.MCG_EMC_P is 1, “Other Information” is in bits 36:32. If bit 37 is 0, system firmware
has not changed the contents of IA32_MCi_STATUS. If bit 37 is 1, system firmware may have edited the
contents of IA32_MCi_STATUS.
• If IA32_MCG_CAP.MCG_CMCI_P[bit 10] is 0, bits 52:38 also contain “Other Information” (in the same
sense as bits 37:32).
15-6 Vol. 3B
MACHINE-CHECK ARCHITECTURE
63 62 61 60 59 58 57 56 55 54 53 52 38 37 36 32 31 16 15 0
V O U E P S A Corrected Error Other
A V C N
MSCOD Model MCA Error Code
C R Count
L E C Info Specific Error Code
R
• If IA32_MCG_CAP[10] is 1, bits 52:38 are architectural (not model-specific). In this case, bits 52:38
reports the value of a 15 bit counter that increments each time a corrected error is observed by the MCA
recording bank. This count value will continue to increment until cleared by software. The most
significant bit, 52, is a sticky count overflow bit.
• If IA32_MCG_CAP[11] is 0, bits 56:53 also contain “Other Information” (in the same sense).
• If IA32_MCG_CAP[11] is 1, bits 56:53 are architectural (not model-specific). In this case, bits 56:53
have the following functionality:
• If IA32_MCG_CAP[24] is 0, bits 56:55 are reserved.
• If IA32_MCG_CAP[24] is 1, bits 56:55 are defined as follows:
• S (Signaling) flag, bit 56 - Signals the reporting of UCR errors in this MC bank. See Section 15.6.2
for additional detail.
• AR (Action Required) flag, bit 55 - Indicates (when set) that MCA error code specific recovery
action must be performed by system software at the time this error was signaled. See Section
15.6.2 for additional detail.
• If the UC bit (Figure 15-6) is 1, bits 54:53 are undefined.
• If the UC bit (Figure 15-6) is 0, bits 54:53 indicate the status of the hardware structure that
reported the threshold-based error. See Table 15-1.
Vol. 3B 15-7
MACHINE-CHECK ARCHITECTURE
• PCC (processor context corrupt) flag, bit 57 — Indicates (when set) that the state of the processor might
have been corrupted by the error condition detected and that reliable restarting of the processor may not be
possible. When clear, this flag indicates that the error did not affect the processor’s state, and software may be
able to restart. When system software supports recovery, consult Section 15.10.4, “Machine-Check Software
Handler Guidelines for Error Recovery” for additional rules that apply.
• ADDRV (IA32_MCi_ADDR register valid) flag, bit 58 — Indicates (when set) that the IA32_MCi_ADDR
register contains the address where the error occurred (see Section 15.3.2.3, “IA32_MCi_ADDR MSRs”). When
clear, this flag indicates that the IA32_MCi_ADDR register is either not implemented or does not contain the
address where the error occurred. Do not read these registers if they are not implemented in the processor.
• MISCV (IA32_MCi_MISC register valid) flag, bit 59 — Indicates (when set) that the IA32_MCi_MISC
register contains additional information regarding the error. When clear, this flag indicates that the
IA32_MCi_MISC register is either not implemented or does not contain additional information regarding the
error. Do not read these registers if they are not implemented in the processor.
• EN (error enabled) flag, bit 60 — Indicates (when set) that the error was enabled by the associated EEj bit
of the IA32_MCi_CTL register.
• UC (error uncorrected) flag, bit 61 — Indicates (when set) that the processor did not or was not able to
correct the error condition. When clear, this flag indicates that the processor was able to correct the error
condition.
• OVER (machine check overflow) flag, bit 62 — Indicates (when set) that a machine-check error occurred
while the results of a previous error were still in the error-reporting register bank (that is, the VAL bit was
already set in the IA32_MCi_STATUS register). The processor sets the OVER flag and software is responsible for
clearing it. In general, enabled errors are written over disabled errors, and uncorrected errors are written over
corrected errors. Uncorrected errors are not written over previous valid uncorrected errors. When
MCG_CMCI_P is set, corrected errors may not set the OVER flag. Software can rely on corrected error count in
IA32_MCi_Status[52:38] to determine if any additional corrected errors may have occurred. For more infor-
mation, see Section 15.3.2.2.1, “Overwrite Rules for Machine Check Overflow”.
• VAL (IA32_MCi_STATUS register valid) flag, bit 63 — Indicates (when set) that the information within the
IA32_MCi_STATUS register is valid. When this flag is set, the processor follows the rules given for the OVER flag
in the IA32_MCi_STATUS register when overwriting previously valid entries. The processor sets the VAL flag
and software is responsible for clearing it.
If a second event overwrites a previously posted event, the information (as guarded by individual valid bits) in the
MCi bank is entirely from the second event. Similarly, if a first event is retained, all of the information previously
posted for that event is retained. In general, when the logged error or the recent error is a corrected error, the
OVER bit (MCi_Status[62]) may be set to indicate an overflow. When MCG_CMCI_P is set in IA32_MCG_CAP,
system software should consult IA32_MCi_STATUS[52:38] to determine if additional corrected errors may have
15-8 Vol. 3B
MACHINE-CHECK ARCHITECTURE
occurred. Software may re-read IA32_MCi_STATUS, IA32_MCi_ADDR and IA32_MCi_MISC appropriately to ensure
data collected represent the last error logged.
After software polls a posting and clears the register, the valid bit is no longer set and therefore the meaning of the
rest of the bits, including the yellow/green/00 status field in bits 54:53, is undefined. The yellow/green indication
will only be posted for events associated with monitored structures – otherwise the unmonitored (00) code will be
posted in IA32_MCi_STATUS[54:53].
Reserved Address
Address*
* Useful bits in this field depend on the address methodology in use when the
the register state is saved.
Vol. 3B 15-9
MACHINE-CHECK ARCHITECTURE
63 9 8 6 5 0
Address Mode
Recoverable Address LSB
• Recoverable Address LSB (bits 5:0): The lowest valid recoverable address bit. Indicates the position of the least
significant bit (LSB) of the recoverable error address. For example, if the processor logs bits [43:9] of the
address, the LSB sub-field in IA32_MCi_MISC is 01001b (9 decimal). For this example, bits [8:0] of the
recoverable error address in IA32_MCi_ADDR should be ignored.
• Address Mode (bits 8:6): Address mode for the address logged in IA32_MCi_ADDR. The supported address
modes are given in Table 15-3.
15.3.2.4.2 IOMCA
Logging and Signaling of errors from PCI Express domain is governed by PCI Express Advanced Error Reporting
(AER) architecture. PCI Express architecture divides errors in two categories: Uncorrectable errors and Correctable
errors. Uncorrectable errors can further be classified as Fatal or Non-Fatal. Uncorrected IO errors are signaled to
the system software either as AER Message Signaled Interrupt (MSI) or via platform specific mechanisms such as
NMI. Generally, the signaling mechanism is controlled by BIOS and/or platform firmware. Certain processors
support an error handling mode, called IOMCA mode, where Uncorrected PCI Express errors are signaled in the
form of machine check exception and logged in machine check banks.
When a processor is in this mode, Uncorrected PCI Express errors are logged in the MCACOD field of the
IA32_MCi_STATUS register as Generic I/O error. The corresponding MCA error code is defined in Table 15-8.
IA32_MCi_Status [15:0] Simple Error Code Encoding. Machine check logging complements and does not replace
AER logging that occurs inside the PCI Express hierarchy. The PCI Express Root Complex and Endpoints continue to
log the error in accordance with PCI Express AER mechanism. In IOMCA mode, MCi_MISC register in the bank that
logged IOMCA can optionally contain information that link the Machine Check logs with the AER logs or proprietary
logs. In such a scenario, the machine check handler can utilize the contents of MCi_MISC to locate the next level of
error logs corresponding to the same error. Specifically, if MCi_Status.MISCV is 1 and MCACOD is 0x0E0B,
MCi_MISC contains the PCI Express address of the Root Complex device containing the AER Logs. Software can
consult the header type and class code registers in the Root Complex device's PCIe Configuration space to deter-
mine what type of device it is. This Root Complex device can either be a PCI Express Root Port, PCI Express Root
Complex Event Collector or a proprietary device.
15-10 Vol. 3B
MACHINE-CHECK ARCHITECTURE
Errors that originate from PCI Express or Legacy Endpoints are logged in the corresponding Root Port in addition to
the generating device. If MISCV=1 and MCi_MISC contains the address of the Root Port or a Root Complex Event
collector, software can parse the AER logs to learn more about the error.
If MISCV=1 and MCi_MISC points to a device that is neither a Root Complex Event Collector not a Root Port, soft-
ware must consult the Vendor ID/Device ID and use device specific knowledge to locate and interpret the error log
registers. In some cases, the Root Complex device configuration space may not be accessible to the software and
both the Vendor and Device ID read as 0xFFFF.
• The format of MCi_MISC for IOMCA errors is shown in Table 15-4.
NOTES:
1. Not Applicable if ADDRV=0.
Refer to PCI Express Specification 3.0 for definition of PCI Express Requestor ID and AER architecture. Refer to PCI
Firmware Specification 3.0 for an explanation of PCI Ex-press Segment number and how software can access
configuration space of a PCI Ex-press device given the segment number and Requestor ID.
63 31 30 29 15 14 0
Reserved Reserved
CMCI_EN—Enable/disable CMCI
Corrected error count threshold
• Corrected error count threshold, bits 14:0 — Software must initialize this field. The value is compared with
the corrected error count field in IA32_MCi_STATUS, bits 38 through 52. An overflow event is signaled to the
CMCI LVT entry (see Table 10-1) in the APIC when the count value equals the threshold value. The new LVT
entry in the APIC is at 02F0H offset from the APIC_BASE. If CMCI interface is not supported for a particular
bank (but IA32_MCG_CAP[10] = 1), this field will always read 0.
• CMCI_EN (Corrected error interrupt enable/disable/indicator), bits 30 — Software sets this bit to
enable the generation of corrected machine-check error interrupt (CMCI). If CMCI interface is not supported for
a particular bank (but IA32_MCG_CAP[10] = 1), this bit is writeable but will always return 0 for that bank. This
bit also indicates CMCI is supported or not supported in the corresponding bank. See Section 15.5 for details of
software detection of CMCI facility.
Vol. 3B 15-11
MACHINE-CHECK ARCHITECTURE
Some microarchitectural sub-systems that are the source of corrected MC errors may be shared by more than one
logical processors. Consequently, the facilities for reporting MC errors and controlling mechanisms may be shared
by more than one logical processors. For example, the IA32_MCi_CTL2 MSR is shared between logical processors
sharing a processor core. Software is responsible to program IA32_MCi_CTL2 MSR in a consistent manner with
CMCI delivery and usage.
After processor reset, IA32_MCi_CTL2 MSRs are zero’ed.
In processors with support for Intel 64 architecture, 64-bit machine check state MSRs are aliased to the legacy
MSRs. In addition, there may be registers beyond IA32_MCG_MISC. These may include up to five reserved MSRs
(IA32_MCG_RESERVED[1:5]) and save-state MSRs for registers introduced in 64-bit mode. See Table 15-6.
15-12 Vol. 3B
MACHINE-CHECK ARCHITECTURE
When a machine-check error is detected on a Pentium 4 or Intel Xeon processor, the processor saves the state of
the general-purpose registers, the R/EFLAGS register, and the R/EIP in these extended machine-check state MSRs.
This information can be used by a debugger to analyze the error.
These registers are read/write to zero registers. This means software can read them; but if software writes to
them, only all zeros is allowed. If software attempts to write a non-zero value into one of these registers, a general-
protection (#GP) exception is generated. These registers are cleared on a hardware reset (power-up or RESET),
but maintain their contents following a soft reset (INIT reset).
Vol. 3B 15-13
MACHINE-CHECK ARCHITECTURE
A processor that supports enhanced cache error reporting contains hardware that tracks the operating status of
certain caches and provides an indicator of their “health”. The hardware reports a “green” status when the number
of lines that incur repeated corrections is at or below a pre-defined threshold, and a “yellow” status when the
number of affected lines exceeds the threshold. Yellow status means that the cache reporting the event is operating
correctly, but you should schedule the system for servicing within a few weeks.
Intel recommends that you rely on this mechanism for structures supported by threshold-base error reporting.
The CPU/system/platform response to a yellow event should be less severe than its response to an uncorrected
error. An uncorrected error means that a serious error has actually occurred, whereas the yellow condition is a
warning that the number of affected lines has exceeded the threshold but is not, in itself, a serious event: the error
was corrected and system state was not compromised.
The green/yellow status indicator is not a foolproof early warning for an uncorrected error resulting from the failure
of two bits in the same ECC block. Such a failure can occur and cause an uncorrected error before the yellow
threshold is reached. However, the chance of an uncorrected error increases as the number of affected lines
increases.
15-14 Vol. 3B
MACHINE-CHECK ARCHITECTURE
CMCI interrupt delivery is configured by writing to the LVT CMCI register entry in the local APIC register space at
default address of APIC_BASE + 2F0H. A CMCI interrupt can be delivered to more than one logical processors if
multiple logical processors are affected by the associated MC errors. For example, if a corrected bit error in a cache
shared by two logical processors caused a CMCI, the interrupt will be delivered to both logical processors sharing
that microarchitectural sub-system. Similarly, package level errors may cause CMCI to be delivered to all logical
processors within the package. However, system level errors will not be handled by CMCI.
See Section 10.5.1, “Local Vector Table” for details regarding the LVT CMCI register.
15.5.2 System Software Recommendation for Managing CMCI and Machine Check Resources
System software must enable and manage CMCI, set up interrupt handlers to service CMCI interrupts delivered to
affected logical processors, program CMCI LVT entry, and query machine check banks that are shared by more
than one logical processors.
This section describes techniques system software can implement to manage CMCI initialization, service CMCI
interrupts in a efficient manner to minimize contentions to access shared MSR resources.
Vol. 3B 15-15
MACHINE-CHECK ARCHITECTURE
— Log and clear all of IA32_MCi_Status registers for the banks that this thread owns. This will allow new
errors to be logged.
15-16 Vol. 3B
MACHINE-CHECK ARCHITECTURE
Vol. 3B 15-17
MACHINE-CHECK ARCHITECTURE
action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
error recovery be performed however, system software can resume execution.
• Software recoverable action required (SRAR) - a UCR error that requires system software to take a recovery
action on this processor before scheduling another stream of execution on this processor. SRAR errors indicate
that the error was detected and raised at the point of the consumption in the execution flow. An SRAR error is
indicated with UC=1, PCC=0, S=1, EN=1 and AR=1 in the IA32_MCi_STATUS register. Recovery actions are
MCA error code specific. The MISCV and the ADDRV flags in the IA32_MCi_STATUS register are set when the
additional error information is available from the IA32_MCi_MISC and the IA32_MCi_ADDR registers. System
software needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific
recovery action for a given SRAR error. If MISCV and ADDRV are not set, it is recommended that system
software shutdown the system.
Table 15-7 summarizes UCR, corrected, and uncorrected errors.
Table 15-7. MC Error Classifications
Type of Error1 UC EN PCC S AR Signaling Software Action Example
Uncorrected Error (UC) 1 1 1 x x MCE If EN=1, reset the system, else log
and OK to keep the system running.
SRAR 1 1 0 1 1 MCE For known MCACOD, take specific Cache to processor load
recovery action; error.
For unknown MCACOD, must
bugcheck.
If OVER=1, reset system, else take
specific recovery action.
SRAO 1 x2 0 x2 0 MCE/CMC For known MCACOD, take specific Patrol scrub and explicit
recovery action; writeback poison errors.
For unknown MCACOD, OK to keep
the system running.
UCNA 1 x 0 0 0 CMC Log the error and Ok to keep the Poison detection error.
system running.
Corrected Error (CE) 0 x x x x CMC Log the error and no corrective ECC in caches and
action required. memory.
NOTES:
1. SRAR, SRAO and UCNA errors are supported by the processor only when IA32_MCG_CAP[24] (MCG_SER_P) is set.
2. EN=1, S=1 when signaled via MCE. EN=x, S=0 when signaled via CMC.
15-18 Vol. 3B
MACHINE-CHECK ARCHITECTURE
Table 15-8 lists overwrite rules for uncorrected errors, corrected errors, and uncorrected recoverable errors.
Table 15-8. Overwrite Rules for UC, CE, and UCR Errors
First Event Second Event UC PCC S AR MCA Bank Reset System
CE UCR 1 0 0 if UCNA, else 1 1 if SRAR, else 0 second yes, if AR=1
UCR CE 1 0 0 if UCNA, else 1 1 if SRAR, else 0 first yes, if AR=1
UCNA UCNA 1 0 0 0 first no
UCNA SRAO 1 0 1 0 first no
UCNA SRAR 1 0 1 1 first yes
SRAO UCNA 1 0 1 0 first no
SRAO SRAO 1 0 1 0 first no
SRAO SRAR 1 0 1 1 first yes
SRAR UCNA 1 0 1 1 first yes
SRAR SRAO 1 0 1 1 first yes
SRAR SRAR 1 0 1 1 first yes
UCR UC 1 1 undefined undefined second yes
UC UCR 1 1 undefined undefined first yes
Vol. 3B 15-19
MACHINE-CHECK ARCHITECTURE
THEN
IA32_MCG_EXT_CTL ← IA32_MCG_EXT_CTL | 01H;
(* System software enables LMCE capability for hardware to signal MCE to a single logical processor*)
FI
ELSE
(* Enable logging of all errors including MC0_CTL register *)
FOR error-reporting banks (0 through MAX_BANK_NUMBER)
DO
IA32_MCi_CTL ← 0FFFFFFFFFFFFFFFFH;
OD
FI
FI
FI
Setup the Machine Check Exception (#MC) handler for vector 18 in IDT
Set the MCE bit (bit 6) in CR4 register to enable Machine-Check Exceptions
FI
15-20 Vol. 3B
MACHINE-CHECK ARCHITECTURE
The “Interpretation” column in the table indicates the name of a compound error. The name is constructed by
substituting mnemonics for the sub-field names given within curly braces. For example, the error code
ICACHEL1_RD_ERR is constructed from the form:
{TT}CACHE{LL}_{RRRR}_ERR,
where {TT} is replaced by I, {LL} is replaced by L1, and {RRRR} is replaced by RD.
For more information on the “Form” and “Interpretation” columns, see Sections Section 15.9.2.1, “Correction
Report Filtering (F) Bit” through Section 15.9.2.5, “Bus and Interconnect Errors”.
Vol. 3B 15-21
MACHINE-CHECK ARCHITECTURE
15-22 Vol. 3B
MACHINE-CHECK ARCHITECTURE
Vol. 3B 15-23
MACHINE-CHECK ARCHITECTURE
Note that the CCCC channel number may be enumerated from zero separately by each memory controller on a
system. On a multi-socket system, or a system with multiple memory controllers per socket, it is necessary to also
consider which machine check bank logged the error. See Chapter 16 for details on specific implementations.
Table 15-16. MCA Compound Error Code Encoding for SRAO Errors
Type MCACOD Value MCA Error Code Encoding1
Memory Scrubbing C0H - CFH 0000_0000_1100_CCCC
000F 0000 1MMM CCCC (Memory Controller Error), where
Memory subfield MMM = 100B (memory scrubbing)
Channel subfield CCCC = channel # or generic
L3 Explicit Writeback 17AH 0000_0001_0111_1010
000F 0001 RRRR TTLL (Cache Hierarchy Error) where
Request subfields RRRR = 0111B (Eviction)
Transaction Type subfields TT = 10B (Generic)
Level subfields LL = 10B
NOTES:
1. Note that for both of these errors the correction report filtering (F) bit (bit 12) of the MCA error must be ignored.
15-24 Vol. 3B
MACHINE-CHECK ARCHITECTURE
Table 15-17 lists values of relevant bit fields of IA32_MCi_STATUS for architecturally defined SRAO errors.
Table 15-17. IA32_MCi_STATUS Values for SRAO Errors
SRAO Error Valid OVER UC EN MISCV ADDRV PCC S AR MCACOD
Memory Scrubbing 1 0 1 x1 1 1 0 x1 0 C0H-CFH
1 1
L3 Explicit Writeback 1 0 1 x 1 1 0 x 0 17AH
NOTES:
1. When signaled as MCE, EN=1 and S=1. If error was signaled via CMC, then EN=x, and S=0.
For both the memory scrubbing and L3 explicit writeback errors, the ADDRV and MISCV flags in the
IA32_MCi_STATUS register are set to indicate that the offending physical address information is available from the
IA32_MCi_MISC and the IA32_MCi_ADDR registers. For the memory scrubbing and L3 explicit writeback errors,
the address mode in the IA32_MCi_MISC register should be set as physical address mode (010b) and the address
LSB information in the IA32_MCi_MISC register should indicate the lowest valid address bit in the address informa-
tion provided from the IA32_MCi_ADDR register.
MCE signal is broadcast to all logical processors as outlined in Section 15.10.4.1. If LMCE is supported and enabled,
some errors (not limited to UCR errors) may be delivered to only a single logical processor. System software should
consult IA32_MCG_STATUS.LMCE_S to determine if the MCE signaled is only to this logical processor.
IA32_MCi_STATUS banks can be shared by logical processors within a core or within the same package. So several
logical processors may find an SRAO error in the shared IA32_MCi_STATUS bank but other processors do not find
it in any of the IA32_MCi_STATUS banks. Table 15-18 shows the RIPV and EIPV flag indication in the
IA32_MCG_STATUS register for the memory scrubbing and L3 explicit writeback errors on both the reporting and
non-reporting logical processors.
Table 15-19. MCA Compound Error Code Encoding for SRAR Errors
Type MCACOD Value MCA Error Code Encoding1
Data Load 134H 0000_0001_0011_0100
000F 0001 RRRR TTLL (Cache Hierarchy Error), where
Request subfield RRRR = 0011B (Data Load)
Transaction Type subfield TT= 01B (Data)
Level subfield LL = 00B (Level 0)
Instruction Fetch 150H 0000_0001_0101_0000
000F 0001 RRRR TTLL (Cache Hierarchy Error), where
Request subfield RRRR = 0101B (Instruction Fetch)
Transaction Type subfield TT= 00B (Instruction)
Level subfield LL = 00B (Level 0)
Vol. 3B 15-25
MACHINE-CHECK ARCHITECTURE
NOTES:
1. Note that for both of these errors the correction report filtering (F) bit (bit 12) of the MCA error must be ignored.
Table 15-20 lists values of relevant bit fields of IA32_MCi_STATUS for architecturally defined SRAR errors.
Table 15-20. IA32_MCi_STATUS Values for SRAR Errors
SRAR Error Valid OVER UC EN MISCV ADDRV PCC S AR MCACOD
Data Load 1 0 1 1 1 1 0 1 1 134H
Instruction Fetch 1 0 1 1 1 1 0 1 1 150H
For both the data load and instruction fetch errors, the ADDRV and MISCV flags in the IA32_MCi_STATUS register
are set to indicate that the offending physical address information is available from the IA32_MCi_MISC and the
IA32_MCi_ADDR registers. For the memory scrubbing and L3 explicit writeback errors, the address mode in the
IA32_MCi_MISC register should be set as physical address mode (010b) and the address LSB information in the
IA32_MCi_MISC register should indicate the lowest valid address bit in the address information provided from the
IA32_MCi_ADDR register.
MCE signal is broadcast to all logical processors on the system on which the UCR errors are supported, except when
the processor supports LMCE and LMCE is enabled by system software (see Section 15.3.1.5). The
IA32_MCG_STATUS MSR allows system software to distinguish the affected logical processor of an SRAR error
amongst logical processors that observed SRAR via MCi_STATUS bank.
Table 15-21 shows the RIPV and EIPV flag indication in the IA32_MCG_STATUS register for the data load and
instruction fetch errors on both the reporting and non-reporting logical processors. The recoverable SRAR error
reported by a processor may be continuable, where the system software can interpret the context of continuable
as follows: the error was isolated, contained. If software can rectify the error condition in the current instruction
stream, the execution context on that logical processor can be continued without loss of information.
NOTES:
1. see the definition of the context of “continuable” above and additional detail below.
15-26 Vol. 3B
MACHINE-CHECK ARCHITECTURE
Vol. 3B 15-27
MACHINE-CHECK ARCHITECTURE
15-28 Vol. 3B
MACHINE-CHECK ARCHITECTURE
does not support recursion. When the processor detects machine-check recursion, it enters the shutdown
state.
Example 15-2 gives typical steps carried out by a machine-check exception handler.
NOTE
On processors that support MCA (CPUID.1.EDX.MCA = 1) reading the P5_MC_TYPE and
P5_MC_ADDR registers may produce invalid data.
When machine-check exceptions are enabled for the Pentium processor (MCE flag is set in control register CR4),
the machine-check exception handler uses the RDMSR instruction to read the error type from the P5_MC_TYPE
register and the machine check address from the P5_MC_ADDR register. The handler then normally reports these
register values to the system console before aborting execution (see Example 15-2).
Vol. 3B 15-29
MACHINE-CHECK ARCHITECTURE
15-30 Vol. 3B
MACHINE-CHECK ARCHITECTURE
can use external bus related model-specific information provided with the error report to localize the source of the
error within the system and determine the appropriate recovery strategy.
Vol. 3B 15-31
MACHINE-CHECK ARCHITECTURE
• When the EN flag is zero but the VAL and UC flags are one in the IA32_MCi_STATUS register, the reported
uncorrected error in this bank is not enabled. As uncorrected errors with the EN flag = 0 are not the source of
machine check exceptions, the MCE handler should log and clear non-enabled errors when the S bit is set and
should continue searching for enabled errors from the other IA32_MCi_STATUS registers. Note that when
IA32_MCG_CAP [24] is 0, any uncorrected error condition (VAL =1 and UC=1) including the one with the EN
flag cleared are fatal and the handler must signal the operating system to reset the system. For the errors that
do not generate machine check exceptions, the EN flag has no meaning.
• When the VAL flag is one, the UC flag is one, the EN flag is one and the PCC flag is zero in the IA32_MCi_STATUS
register, the error in this bank is an uncorrected recoverable (UCR) error. The MCE handler needs to examine
the S flag and the AR flag to find the type of the UCR error for software recovery and determine if software error
recovery is possible.
• When both the S and the AR flags are clear in the IA32_MCi_STATUS register for the UCR error (VAL=1, UC=1,
EN=x and PCC=0), the error in this bank is an uncorrected no-action required error (UCNA). UCNA errors are
uncorrected but do not require any OS recovery action to continue execution. These errors indicate that some
data in the system is corrupt, but that data has not been consumed and may not be consumed. If that data is
consumed a non-UCNA machine check exception will be generated. UCNA errors are signaled in the same way
as corrected machine check errors and the CMCI and CMC polling handler is primarily responsible for handling
UCNA errors. Like corrected errors, the MCA handler can optionally log and clear UCNA errors as long as it can
avoid the undesired race condition with the CMCI or CMC polling handler. As UCNA errors are not the source of
machine check exceptions, the MCA handler should continue searching for uncorrected or software recoverable
errors in all other MC banks.
• When the S flag in the IA32_MCi_STATUS register is set for the UCR error ((VAL=1, UC=1, EN=1 and PCC=0),
the error in this bank is software recoverable and it was signaled through a machine-check exception. The AR
flag in the IA32_MCi_STATUS register further clarifies the type of the software recoverable errors.
• When the AR flag in the IA32_MCi_STATUS register is clear for the software recoverable error (VAL=1, UC=1,
EN=1, PCC=0 and S=1), the error in this bank is a software recoverable action optional (SRAO) error. The MCE
handler and the operating system can analyze the IA32_MCi_STATUS [15:0] to implement MCA error code
specific optional recovery action, but this recovery action is optional. System software can resume the program
execution from the instruction pointer saved on the stack for the machine check exception when the RIPV flag
in the IA32_MCG_STATUS register is set.
• Even if the OVER flag in the IA32_MCi_STATUS register is set for the SRAO error (VAL=1, UC=1, EN=1, PCC=0,
S=1 and AR=0), the MCE handler can take recovery action for the SRAO error logged in the IA32_MCi_STATUS
register. Since the recovery action for SRAO errors is optional, restarting the program execution from the
instruction pointer saved on the stack for the machine check exception is still possible for the overflowed SRAO
error if the RIPV flag in the IA32_MCG_STATUS is set.
• When the AR flag in the IA32_MCi_STATUS register is set for the software recoverable error (VAL=1, UC=1,
EN=1, PCC=0 and S=1), the error in this bank is a software recoverable action required (SRAR) error. The MCE
handler and the operating system must take recovery action in order to continue execution after the machine-
check exception. The MCA handler and the operating system need to analyze the IA32_MCi_STATUS [15:0] to
determine the MCA error code specific recovery action. If no recovery action can be performed, the operating
system must reset the system.
• When the OVER flag in the IA32_MCi_STATUS register is set for the SRAR error (VAL=1, UC=1, EN=1, PCC=0,
S=1 and AR=1), the MCE handler cannot take recovery action as the information of the SRAR error in the
IA32_MCi_STATUS register was potentially lost due to the overflow condition. Since the recovery action for
SRAR errors must be taken, the MCE handler must signal the operating system to reset the system.
• When the MCE handler cannot find any uncorrected (VAL=1, UC=1 and EN=1) or any software recoverable
errors (VAL=1, UC=1, EN=1, PCC=0 and S=1) in any of the IA32_MCi banks of the processors, this is an
unexpected condition for the MCE handler and the handler should signal the operating system to reset the
system.
• Before returning from the machine-check exception handler, software must clear the MCIP flag in the
IA32_MCG_STATUS register. The MCIP flag is used to detect recursion. The machine-check architecture does
not support recursion. When the processor receives a machine check when MCIP is set, it automatically enters
the shutdown state.
Example 15-4 gives pseudocode for an MC exception handler that supports recovery of UCR.
15-32 Vol. 3B
MACHINE-CHECK ARCHITECTURE
IF NOERROR = TRUE
THEN
IF NOT (MCG_RIPV = 1 AND MCG_EIPV = 0)
THEN
RESTARTABILITY = FALSE;
FI
FI;
IF RESTARTABILITY = FALSE
THEN
Report RESTARTABILITY to console;
Reset system;
FI;
IF MCA_BROADCAST = TRUE
THEN
IF ProcessorCount = MAX_PROCESSORS
AND NOERROR = TRUE
THEN
Report RESTARTABILITY to console;
Reset system;
FI;
Release SpinLock;
Wait till ProcessorCount = MAX_PROCESSRS on system;
(* implement a timeout and abort function if necessary *)
FI;
CLEAR IA32_MCG_STATUS;
RESUME Execution;
(* End of MACHINE CHECK HANDLER*)
MCA ERROR PROCESSING: (* MCA Error Processing Routine called from MCA Handler *)
IF MCIP flag in IA32_MCG_STATUS = 0
THEN (* MCIP=0 upon MCA is unexpected *)
RESTARTABILITY = FALSE;
FI;
Vol. 3B 15-33
MACHINE-CHECK ARCHITECTURE
15-34 Vol. 3B
MACHINE-CHECK ARCHITECTURE
Vol. 3B 15-35
MACHINE-CHECK ARCHITECTURE
15-36 Vol. 3B
16.Updates to Chapter 16, Volume 3B
Change bars and green text show changes to Chapter 16 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3B: System Programming Guide, Part 2.
------------------------------------------------------------------------------------------
Changes to this chapter: Added new Machine Check sections for future processors with CPUID
DisplayFamily_DisplayModel Signatures 06_6AH, 06_6CH, and 06_86H. Updated processor naming as necessary.
Encoding of the model-specific and other information fields is different across processor families. The differences
are documented in the following sections.
These errors are reported in the IA32_MCi_STATUS MSRs. They are reported architecturally as compound errors
with a general form of 0000 1PPT RRRR IILL in the MCA error code field. See Chapter 15 for information on the
interpretation of compound error codes. Incremental decoding information is listed in Table 16-2.
Table 16-2. Incremental Decoding Information: Processor Family 06H Machine Error Codes For Machine Check
Type Bit No. Bit Function Bit Description
MCA error 15:0
codes1
Model specific 18:16 Reserved Reserved
errors
Model specific 24:19 Bus queue request 000000 for BQ_DCU_READ_TYPE error
errors type 000010 for BQ_IFU_DEMAND_TYPE error
000011 for BQ_IFU_DEMAND_NC_TYPE error
000100 for BQ_DCU_RFO_TYPE error
000101 for BQ_DCU_RFO_LOCK_TYPE error
000110 for BQ_DCU_ITOM_TYPE error
001000 for BQ_DCU_WB_TYPE error
001010 for BQ_DCU_WCEVICT_TYPE error
001011 for BQ_DCU_WCLINE_TYPE error
001100 for BQ_DCU_BTM_TYPE error
Vol. 3B 16-1
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-2. Incremental Decoding Information: Processor Family 06H Machine Error Codes For Machine Check
Type Bit No. Bit Function Bit Description
001101 for BQ_DCU_INTACK_TYPE error
001110 for BQ_DCU_INVALL2_TYPE error
001111 for BQ_DCU_FLUSHL2_TYPE error
010000 for BQ_DCU_PART_RD_TYPE error
010010 for BQ_DCU_PART_WR_TYPE error
010100 for BQ_DCU_SPEC_CYC_TYPE error
011000 for BQ_DCU_IO_RD_TYPE error
011001 for BQ_DCU_IO_WR_TYPE error
011100 for BQ_DCU_LOCK_RD_TYPE error
011110 for BQ_DCU_SPLOCK_RD_TYPE error
011101 for BQ_DCU_LOCK_WR_TYPE error
Model specific 27:25 Bus queue error type 000 for BQ_ERR_HARD_TYPE error
errors 001 for BQ_ERR_DOUBLE_TYPE error
010 for BQ_ERR_AERR2_TYPE error
100 for BQ_ERR_SINGLE_TYPE error
101 for BQ_ERR_AERR1_TYPE error
Model specific 28 FRC error 1 if FRC error active
errors
29 BERR 1 if BERR is driven
30 Internal BINIT 1 if BINIT driven for this processor
31 Reserved Reserved
Other 34:32 Reserved Reserved
information
35 External BINIT 1 if BINIT is received from external bus.
36 Response parity error This bit is asserted in IA32_MCi_STATUS if this component has received a parity
error on the RS[2:0]# pins for a response transaction. The RS signals are checked
by the RSP# external pin.
37 Bus BINIT This bit is asserted in IA32_MCi_STATUS if this component has received a hard
error response on a split transaction one access that has needed to be split across
the 64-bit external bus interface into two accesses).
38 Timeout BINIT This bit is asserted in IA32_MCi_STATUS if this component has experienced a ROB
time-out, which indicates that no micro-instruction has been retired for a
predetermined period of time.
A ROB time-out occurs when the 15-bit ROB time-out counter carries a 1 out of its
high order bit. 2 The timer is cleared when a micro-instruction retires, an exception
is detected by the core processor, RESET is asserted, or when a ROB BINIT occurs.
The ROB time-out counter is prescaled by the 8-bit PIC timer which is a divide by
128 of the bus clock the bus clock is 1:2, 1:3, 1:4 of the core clock). When a carry
out of the 8-bit PIC timer occurs, the ROB counter counts up by one. While this bit
is asserted, it cannot be overwritten by another error.
41:39 Reserved Reserved
42 Hard error This bit is asserted in IA32_MCi_STATUS if this component has initiated a bus
transactions which has received a hard error response. While this bit is asserted, it
cannot be overwritten.
16-2 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-2. Incremental Decoding Information: Processor Family 06H Machine Error Codes For Machine Check
Type Bit No. Bit Function Bit Description
43 IERR This bit is asserted in IA32_MCi_STATUS if this component has experienced a
failure that causes the IERR pin to be asserted. While this bit is asserted, it cannot
be overwritten.
44 AERR This bit is asserted in IA32_MCi_STATUS if this component has initiated 2 failing
bus transactions which have failed due to Address Parity Errors AERR asserted).
While this bit is asserted, it cannot be overwritten.
45 UECC The Uncorrectable ECC error bit is asserted in IA32_MCi_STATUS for uncorrected
ECC errors. While this bit is asserted, the ECC syndrome field will not be
overwritten.
46 CECC The correctable ECC error bit is asserted in IA32_MCi_STATUS for corrected ECC
errors.
54:47 ECC syndrome The ECC syndrome field in IA32_MCi_STATUS contains the 8-bit ECC syndrome only
if the error was a correctable/uncorrectable ECC error and there wasn't a previous
valid ECC error syndrome logged in IA32_MCi_STATUS.
A previous valid ECC error in IA32_MCi_STATUS is indicated by
IA32_MCi_STATUS.bit45 uncorrectable error occurred) being asserted. After
processing an ECC error, machine-check handling software should clear
IA32_MCi_STATUS.bit45 so that future ECC error syndromes can be logged.
56:55 Reserved Reserved.
Status register 63:57
validity
indicators1
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
2. For processors with a CPUID signature of 06_0EH, a ROB time-out occurs when the 23-bit ROB time-out counter carries a 1 out of its
high order bit.
Table 16-3. CPUID DisplayFamily_DisplayModel Signatures for Processors Based on Intel Core Microarchitecture
DisplayFamily_DisplayModel Processor Families/Processor Number Series
06_1DH Intel Xeon Processor 7400 series.
06_17H Intel Xeon Processor 5200, 5400 series, Intel Core 2 Quad processor Q9650.
06_0FH Intel Xeon Processor 3000, 3200, 5100, 5300, 7300 series, Intel Core 2 Quad, Intel Core 2 Extreme,
Intel Core 2 Duo processors, Intel Pentium dual-core processors.
Vol. 3B 16-3
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-4. Incremental Bus Error Codes of Machine Check for Processors
Based on Intel Core Microarchitecture
Type Bit No. Bit Function Bit Description
MCA error 15:0
codes1
Model specific 18:16 Reserved Reserved
errors
Model specific 24:19 Bus queue request ‘000001 for BQ_PREF_READ_TYPE error
errors type 000000 for BQ_DCU_READ_TYPE error
000010 for BQ_IFU_DEMAND_TYPE error
000011 for BQ_IFU_DEMAND_NC_TYPE error
000100 for BQ_DCU_RFO_TYPE error
000101 for BQ_DCU_RFO_LOCK_TYPE error
000110 for BQ_DCU_ITOM_TYPE error
001000 for BQ_DCU_WB_TYPE error
001010 for BQ_DCU_WCEVICT_TYPE error
001011 for BQ_DCU_WCLINE_TYPE error
001100 for BQ_DCU_BTM_TYPE error
001101 for BQ_DCU_INTACK_TYPE error
001110 for BQ_DCU_INVALL2_TYPE error
001111 for BQ_DCU_FLUSHL2_TYPE error
010000 for BQ_DCU_PART_RD_TYPE error
010010 for BQ_DCU_PART_WR_TYPE error
010100 for BQ_DCU_SPEC_CYC_TYPE error
011000 for BQ_DCU_IO_RD_TYPE error
011001 for BQ_DCU_IO_WR_TYPE error
011100 for BQ_DCU_LOCK_RD_TYPE error
011110 for BQ_DCU_SPLOCK_RD_TYPE error
011101 for BQ_DCU_LOCK_WR_TYPE error
100100 for BQ_L2_WI_RFO_TYPE error
100110 for BQ_L2_WI_ITOM_TYPE error
Model specific 27:25 Bus queue error type ‘001 for Address Parity Error
errors ‘010 for Response Hard Error
‘011 for Response Parity Error
Model specific 28 MCE Driven 1 if MCE is driven
errors
29 MCE Observed 1 if MCE is observed
30 Internal BINIT 1 if BINIT driven for this processor
31 BINIT Observed 1 if BINIT is observed for this processor
Other 33:32 Reserved Reserved
information
34 PIC and FSB data Data Parity detected on either PIC or FSB access
parity
35 Reserved Reserved
16-4 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-4. Incremental Bus Error Codes of Machine Check for Processors
Based on Intel Core Microarchitecture (Contd.)
Type Bit No. Bit Function Bit Description
36 Response parity error This bit is asserted in IA32_MCi_STATUS if this component has received a parity
error on the RS[2:0]# pins for a response transaction. The RS signals are checked
by the RSP# external pin.
37 FSB address parity Address parity error detected:
1 = Address parity error detected
0 = No address parity error
38 Timeout BINIT This bit is asserted in IA32_MCi_STATUS if this component has experienced a ROB
time-out, which indicates that no micro-instruction has been retired for a
predetermined period of time.
A ROB time-out occurs when the 23-bit ROB time-out counter carries a 1 out of its
high order bit. The timer is cleared when a micro-instruction retires, an exception is
detected by the core processor, RESET is asserted, or when a ROB BINIT occurs.
The ROB time-out counter is prescaled by the 8-bit PIC timer which is a divide by
128 of the bus clock the bus clock is 1:2, 1:3, 1:4 of the core clock). When a carry
out of the 8-bit PIC timer occurs, the ROB counter counts up by one. While this bit
is asserted, it cannot be overwritten by another error.
41:39 Reserved Reserved
42 Hard error This bit is asserted in IA32_MCi_STATUS if this component has initiated a bus
transactions which has received a hard error response. While this bit is asserted, it
cannot be overwritten.
43 IERR This bit is asserted in IA32_MCi_STATUS if this component has experienced a
failure that causes the IERR pin to be asserted. While this bit is asserted, it cannot
be overwritten.
44 Reserved Reserved
45 Reserved Reserved
46 Reserved Reserved
54:47 Reserved Reserved
56:55 Reserved Reserved.
Status register 63:57
validity
indicators1
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
16.2.1 Model-Specific Machine Check Error Codes for Intel Xeon Processor 7400 Series
Intel Xeon processor 7400 series has machine check register banks that generally follows the description of
Chapter 15 and Section 16.2. Additional error codes specific to Intel Xeon processor 7400 series is describe in this
section.
MC4_STATUS[63:0] is the main error logging for the processor’s L3 and front side bus errors for Intel Xeon
processor 7400 series. It supports the L3 Errors, Bus and Interconnect Errors Compound Error Codes in the MCA
Error Code Field.
Vol. 3B 16-5
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-5. Incremental MCA Error Code Types for Intel Xeon Processor 7400
Processor MCA_Error_Code (MC6_STATUS[15:0])
Type Error Code Binary Encoding Meaning
C Internal Error 0000 0100 0000 0000 Internal Error Type Code
B Bus and 0000 100x 0000 1111 Not used but this encoding is reserved for compatibility with other MCA
Interconnect implementations
Error 0000 101x 0000 1111 Not used but this encoding is reserved for compatibility with other MCA
implementations
0000 110x 0000 1111 Not used but this encoding is reserved for compatibility with other MCA
implementations
0000 1110 0000 1111 Bus and Interconnection Error Type Code
0000 1111 0000 1111 Not used but this encoding is reserved for compatibility with other MCA
implementations
The Bold faced binary encodings are the only encodings used by the processor for MC4_STATUS[15:0].
16.2.2 Intel Xeon Processor 7400 Model Specific Error Code Field
Note: The Model Specific Error Code field in MC6_STATUS (bits 31:16).
16-6 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Vol. 3B 16-7
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-8. Intel QPI Machine Check Error Codes for IA32_MC0_STATUS and IA32_MC1_STATUS
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Table 16-9. Intel QPI Machine Check Error Codes for IA32_MC0_MISC and IA32_MC1_MISC
Type Bit No. Bit Function Bit Description
Model specific errors1
7:0 QPI Opcode Message class and opcode from the packet with the error
13:8 RTId QPI Request Transaction ID
15:14 Reserved Reserved
18:16 RHNID QPI Requestor/Home Node ID
23:19 Reserved Reserved
24 IIB QPI Interleave/Head Indication Bit
NOTES:
1. Which of these fields are valid depends on the error type.
16-8 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-11. Incremental Memory Controller Error Codes of Machine Check for IA32_MC8_STATUS
Type Bit No. Bit Function Bit Description
MCA error codes1 15:0 MCACOD Memory error format: 1MMMCCCC
Model specific errors
16 Read ECC error if 1, ECC occurred on a read
17 RAS ECC error If 1, ECC occurred on a scrub
18 Write parity error If 1, bad parity on a write
19 Redundancy loss if 1, Error in half of redundant memory
20 Reserved Reserved
21 Memory range error If 1, Memory access out of range
22 RTID out of range If 1, Internal ID invalid
23 Address parity error If 1, bad address parity
24 Byte enable parity If 1, bad enable parity
error
Other information 37:25 Reserved Reserved
52:38 CORE_ERR_CNT Corrected error count
56:53 Reserved Reserved
Status register validity 63:57
indicators1
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Vol. 3B 16-9
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-12. Incremental Memory Controller Error Codes of Machine Check for IA32_MC8_MISC
Type Bit No. Bit Function Bit Description
Model specific errors1
7:0 RTId Transaction Tracker ID
15:8 Reserved Reserved
17:16 DIMM DIMM ID which got the error
19:18 Channel Channel ID which got the error
31:20 Reserved Reserved
63:32 Syndrome ECC Syndrome
NOTES:
1. Which of these fields are valid depends on the error type.
16-10 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-14. Intel QPI MC Error Codes for IA32_MC6_STATUS and IA32_MC7_STATUS
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Vol. 3B 16-11
INTERPRETING MACHINE-CHECK ERROR CODES
tion logging of the IMC. The additional error information logged by the IMC is stored in IA32_MCi_STATUS and
IA32_MCi_MISC; (i = 8, 11).
Table 16-15. Intel IMC MC Error Codes for IA32_MCi_STATUS (i= 8, 11)
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Table 16-16. Intel IMC MC Error Codes for IA32_MCi_MISC (i= 8, 11)
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
16-12 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Vol. 3B 16-13
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-18. Intel IMC MC Error Codes for IA32_MCi_STATUS (i= 9-16)
37 Reserved Reserved
56:38 See Chapter 15, “Machine-Check Architecture,”
Status register 63:57
validity indicators1
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
16-14 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-19. Intel IMC MC Error Codes for IA32_MCi_MISC (i= 9-16)
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Vol. 3B 16-15
INTERPRETING MACHINE-CHECK ERROR CODES
16-16 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-21. Intel QPI MC Error Codes for IA32_MCi_STATUS (i = 5, 20, 21)
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Vol. 3B 16-17
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-22. Intel IMC MC Error Codes for IA32_MCi_STATUS (i= 9-16)
37 Reserved Reserved
56:38 See Chapter 15, “Machine-Check Architecture,”
Status register 63:57
validity indicators1
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Table 16-23. Intel IMC MC Error Codes for IA32_MCi_MISC (i= 9-16)
16-18 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Vol. 3B 16-19
INTERPRETING MACHINE-CHECK ERROR CODES
16-20 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-25. Intel IMC MC Error Codes for IA32_MCi_STATUS (i= 9-10)
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Vol. 3B 16-21
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-26. Intel IMC MC Error Codes for IA32_MCi_STATUS (i= 9-16)
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
16-22 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Vol. 3B 16-23
INTERPRETING MACHINE-CHECK ERROR CODES
16-24 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Vol. 3B 16-25
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-30. Intel IMC MC Error Codes for IA32_MCi_STATUS (i= 13-18)
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
16-26 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Vol. 3B 16-27
INTERPRETING MACHINE-CHECK ERROR CODES
16-28 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Vol. 3B 16-29
INTERPRETING MACHINE-CHECK ERROR CODES
16-30 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
NOTE
The interconnect machine check errors in this section apply only to future Intel Xeon processors
with a CPUID DisplayFamily_DisplaySignature of 06_6AH. These do not apply to future Intel Xeon
processors with a CPUID DisplayFamily_DisplaySignature of 06_6CH.
Table 16-35 lists model-specific fields to interpret error codes applicable to IA32_MCi_STATUS, i= 5, 7, 8.
Table 16-35. Interconnect MC Error Codes for IA32_MCi_STATUS, i = 5, 7, 8
Vol. 3B 16-31
INTERPRETING MACHINE-CHECK ERROR CODES
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
16-32 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-36. Intel IMC MC Error Codes for IA32_MCi_STATUS (i= 13-15, 17-19, 21-23, 25-27)
Vol. 3B 16-33
INTERPRETING MACHINE-CHECK ERROR CODES
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
16-34 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
The supported error codes follow the architectural MCACOD definition type 1MMMCCCC (see Chapter 15, “Machine-
Check Architecture”).
Table 16-37. M2M MC Error Codes for IA32_MCi_STATUS (i= 12, 16, 20, 24)
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
MC error codes associated with mirrored memory corrections are reported in the MSRs IA32_MC12_MISC,
IA32_MC16_MISC, IA32_MC20_MISC, and IA32_MC24_MISC. The model-specific error codes listed in Table 16-32
also apply to IA32_MCi_MISC, i = 12, 16, 20, 24.
Vol. 3B 16-35
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-38. Incremental Decoding Information: Processor Family 0FH Machine Error Codes For Machine Check
Type Bit No. Bit Function Bit Description
MCA error 15:0
codes1
Model-specific 16 FSB address parity Address parity error detected:
error codes 1 = Address parity error detected
0 = No address parity error
17 Response hard fail Hardware failure detected on response
18 Response parity Parity error detected on response
19 PIC and FSB data parity Data Parity detected on either PIC or FSB access
20 Processor Signature = Processor Signature = 00000F04H. Indicates error due to an invalid PIC
00000F04H: Invalid PIC request access was made to PIC space with WB memory):
request 1 = Invalid PIC request error
0 = No Invalid PIC request error
Reserved
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Table 16-10 provides information on interpreting additional family 0FH, model specific fields for cache hierarchy
errors. These errors are reported in one of the IA32_MCi_STATUS MSRs. These errors are reported, architecturally,
16-36 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
as compound errors with a general form of 0000 0001 RRRR TTLL in the MCA error code field. See Chapter 15 for
how to interpret the compound error code.
16.13.1 Model-Specific Machine Check Error Codes for Intel Xeon Processor MP 7100 Series
Intel Xeon processor MP 7100 series has 5 register banks which contains information related to Machine Check
Errors. MCi_STATUS[63:0] refers to all 5 register banks. MC0_STATUS[63:0] through MC3_STATUS[63:0] is the
same as on previous generation of Intel Xeon processors within Family 0FH. MC4_STATUS[63:0] is the main error
logging for the processor’s L3 and front side bus errors. It supports the L3 Errors, Bus and Interconnect Errors
Compound Error Codes in the MCA Error Code Field.
Vol. 3B 16-37
INTERPRETING MACHINE-CHECK ERROR CODES
Table 16-40. Incremental MCA Error Code for Intel Xeon Processor MP 7100
Processor MCA_Error_Code (MC4_STATUS[15:0])
Type Error Code Binary Encoding Meaning
C Internal Error 0000 0100 0000 0000 Internal Error Type Code
A L3 Tag Error 0000 0001 0000 1011 L3 Tag Error Type Code
B Bus and 0000 100x 0000 1111 Not used but this encoding is reserved for compatibility with other MCA
Interconnect implementations
Error 0000 101x 0000 1111 Not used but this encoding is reserved for compatibility with other MCA
implementations
0000 110x 0000 1111 Not used but this encoding is reserved for compatibility with other MCA
implementations
0000 1110 0000 1111 Bus and Interconnection Error Type Code
0000 1111 0000 1111 Not used but this encoding is reserved for compatibility with other MCA
implementations
The Bold faced binary encodings are the only encodings used by the processor for MC4_STATUS[15:0].
16-38 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
Note: The Model Specific Error Code field in MC4_STATUS (bits 31:16).
16.13.3.2 Processor Model Specific Error Code Field Type B: Bus and Interconnect Error
Note: The Model Specific Error Code field in MC4_STATUS (bits 31:16).
Vol. 3B 16-39
INTERPRETING MACHINE-CHECK ERROR CODES
Exactly one of the bits defined in the preceding table will be set for a Bus and Interconnect Error. The Data ECC can
be correctable or uncorrectable (the MC4_STATUS.UC bit, of course, distinguishes between correctable and uncor-
rectable cases with the Other_Info field possibly providing the ECC Syndrome for correctable errors). All other
errors for this processor MCA Error Type are uncorrectable.
16.13.3.3 Processor Model Specific Error Code Field Type C: Cache Bus Controller Error
16-40 Vol. 3B
INTERPRETING MACHINE-CHECK ERROR CODES
All errors - except for the correctable ECC types - in this table are uncorrectable. The correctable ECC events may
supply the ECC syndrome in the Other_Info field of the MC4_STATUS MSR.
Table 16-45. Decoding Family 0FH Machine Check Codes for Cache Hierarchy Errors
Type Bit No. Bit Function Bit Description
MCA error 15:0
codes1
Model 17:16 Tag Error Code Contains the tag error code for this machine check error:
specific error 00 = No error detected
codes
01 = Parity error on tag miss with a clean line
10 = Parity error/multiple tag match on tag hit
11 = Parity error/multiple tag match on tag miss
19:18 Data Error Code Contains the data error code for this machine check error:
00 = No error detected
01 = Single bit error
10 = Double bit error on a clean line
11 = Double bit error on a modified line
20 L3 Error This bit is set if the machine check error originated in the L3 it can be ignored for
invalid PIC request errors):
1 = L3 error
0 = L2 error
21 Invalid PIC Request Indicates error due to invalid PIC request access was made to PIC space with WB
memory):
1 = Invalid PIC request error
0 = No invalid PIC request error
31:22 Reserved Reserved
Other 39:32 8-bit Error Count Holds a count of the number of errors since reset. The counter begins at 0 for the
Information first error and saturates at a count of 255.
NOTES:
1. These fields are architecturally defined. Refer to Chapter 15, “Machine-Check Architecture,” for more information.
Vol. 3B 16-41
INTERPRETING MACHINE-CHECK ERROR CODES
16-42 Vol. 3B
17.Updates to Chapter 18, Volume 3B
Change bars and green text show changes to Chapter 18 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3B: System Programming Guide, Part 2.
------------------------------------------------------------------------------------------
Changes to this chapter: Updated section 18.8 “IA32_PERF_CAPABILITIES MSR Enumeration” with new informa-
tion about IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14].
Intel 64 and IA-32 architectures provide facilities for monitoring performance via a PMU (Performance Monitoring
Unit).
Vol. 3B 18-1
PERFORMANCE MONITORING
• Section 18.3.8, “6th Generation, 7th Generation and 8th Generation Intel® Core™ Processor
Performance Monitoring Facility”
• Section 18.3.9, “Next Generation Intel® Core™ Processor Performance Monitoring Facility”
— Section 18.4, “Performance monitoring (Intel® Xeon™ Phi Processors)”
• Section 18.4.1, “Intel® Xeon Phi™ Processor 7200/5200/3200 Performance Monitoring”
— Section 18.5, “Performance Monitoring (Intel® Atom™ Processors)”
• Section 18.5.1, “Performance Monitoring (45 nm and 32 nm Intel® Atom™ Processors)”
• Section 18.5.2, “Performance Monitoring for Silvermont Microarchitecture”
• Section 18.5.3, “Performance Monitoring for Goldmont Microarchitecture”
• Section 18.5.4, “Performance Monitoring for Goldmont Plus Microarchitecture”
• Section 18.5.5, “Performance Monitoring for Tremont Microarchitecture”
— Section 18.6, “Performance Monitoring (Legacy Intel Processors)”
• Section 18.6.1, “Performance Monitoring (Intel® Core™ Solo and Intel® Core™ Duo Processors)”
• Section 18.6.2, “Performance Monitoring (Processors Based on Intel® Core™ Microarchitecture)”
• Section 18.6.3, “Performance Monitoring (Processors Based on Intel NetBurst® Microarchitecture)”
• Section 18.6.4, “Performance Monitoring and Intel Hyper-Threading Technology in Processors Based on
Intel NetBurst® Microarchitecture”
• Section 18.6.4.5, “Counting Clocks on systems with Intel Hyper-Threading Technology in
Processors Based on Intel NetBurst® Microarchitecture”
• Section 18.6.5, “Performance Monitoring and Dual-Core Technology”
• Section 18.6.6, “Performance Monitoring on 64-bit Intel Xeon Processor MP with Up to 8-MByte L3
Cache”
• Section 18.6.7, “Performance Monitoring on L3 and Caching Bus Controller Sub-Systems”
• Section 18.6.8, “Performance Monitoring (P6 Family Processor)”
• Section 18.6.9, “Performance Monitoring (Pentium Processors)”
— Section 18.7, “Counting Clocks”
— Section 18.8, “IA32_PERF_CAPABILITIES MSR Enumeration”
— Section 18.9, “PEBS Facility”
18-2 Vol. 3B
PERFORMANCE MONITORING
Intel Core processors and related Intel Xeon processor families based on the Nehalem through Broadwell microar-
chitectures support version ID 1, 2, and 3. Intel processors based on the Skylake, Kaby Lake and Coffee Lake
microarchitectures support versionID 4.
Next generation Intel Atom processors are based on the Goldmont microarchitecture. Intel processors based on
the Goldmont microarchitecture support versionID 4.
Vol. 3B 18-3
PERFORMANCE MONITORING
63 31 24 23 22 21 20 19 18 17 16 15 8 7 0
Counter Mask I E I U
N N P E O S Unit Mask (UMASK) Event Select
(CMASK) V N T C S R
• Unit mask (UMASK) field (bits 8 through 15) — These bits qualify the condition that the selected event
logic unit detects. Valid UMASK values for each event logic unit are specific to the unit. For each architectural
performance event, its corresponding UMASK value defines a specific microarchitectural condition.
A pre-defined microarchitectural condition associated with an architectural event may not be applicable to a
given processor. The processor then reports only a subset of pre-defined architectural events. Pre-defined
architectural events are listed in Table 18-1; support for pre-defined architectural events is enumerated using
CPUID.0AH:EBX. Architectural performance events available in the initial implementation are listed in Table
19-1.
• USR (user mode) flag (bit 16) — Specifies that the selected microarchitectural condition is counted when
the logical processor is operating at privilege levels 1, 2 or 3. This flag can be used with the OS flag.
• OS (operating system mode) flag (bit 17) — Specifies that the selected microarchitectural condition is
counted when the logical processor is operating at privilege level 0. This flag can be used with the USR flag.
• E (edge detect) flag (bit 18) — Enables (when set) edge detection of the selected microarchitectural
condition. The logical processor counts the number of deasserted to asserted transitions for any condition that
can be expressed by the other fields. The mechanism does not permit back-to-back assertions to be distin-
guished.
This mechanism allows software to measure not only the fraction of time spent in a particular state, but also the
average length of time spent in such a state (for example, the time spent waiting for an interrupt to be
serviced).
• PC (pin control) flag (bit 19) — Beginning with Sandy Bridge microarchitecture, this bit is reserved (not
writeable). On processors based on previous microarchitectures, the logical processor toggles the PMi pins and
increments the counter when performance-monitoring events occur; when clear, the processor toggles the PMi
pins when the counter overflows. The toggling of a pin is defined as assertion of the pin for a single bus clock
followed by deassertion.
• INT (APIC interrupt enable) flag (bit 20) — When set, the logical processor generates an exception
through its local APIC on counter overflow.
• EN (Enable Counters) Flag (bit 22) — When set, performance counting is enabled in the corresponding
performance-monitoring counter; when clear, the corresponding counter is disabled. The event logic unit for a
UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0, before writing to IA32_PMCx.
18-4 Vol. 3B
PERFORMANCE MONITORING
• INV (invert) flag (bit 23) — When set, inverts the counter-mask (CMASK) comparison, so that both greater
than or equal to and less than comparisons can be made (0: greater than or equal; 1: less than). Note if
counter-mask is programmed to zero, INV flag is ignored.
• Counter mask (CMASK) field (bits 24 through 31) — When this field is not zero, a logical processor
compares this mask to the events count of the detected microarchitectural condition during a single cycle. If
the event count is greater than or equal to this mask, the counter is incremented by one. Otherwise the counter
is not incremented.
This mask is intended for software to characterize microarchitectural conditions that can count multiple
occurrences per cycle (for example, two or more instructions retired per clock; or bus queue occupations). If
the counter-mask field is 0, then the counter is incremented each cycle by the event count associated with
multiple occurrences.
Table 18-1. UMask and Event Select Encodings for Pre-Defined Architectural Performance Events
Bit Position Event Name UMask Event Select
CPUID.AH.EBX
0 UnHalted Core Cycles 00H 3CH
1 Instruction Retired 00H C0H
2 UnHalted Reference Cycles 01H 3CH
3 LLC Reference 4FH 2EH
4 LLC Misses 41H 2EH
5 Branch Instruction Retired 00H C4H
6 Branch Misses Retired 00H C5H
7 Topdown Slots 01H A4H
A processor that supports architectural performance monitoring may not support all the predefined architectural
performance events (Table 18-1). The number of architectural events is reported through CPUID.0AH:EAX[31:24],
while non-zero bits in CPUID.0AH:EBX indicate any architectural events that are not available.
The behavior of each architectural performance event is expected to be consistent on all processors that support
that event. Minor variations between microarchitectures are noted below:
• UnHalted Core Cycles — Event select 3CH, Umask 00H
This event counts core clock cycles when the clock signal on a specific core is running (not halted). The counter
does not advance in the following conditions:
— an ACPI C-state other than C0 for normal operation
— HLT
— STPCLK# pin asserted
— being throttled by TM1
— during the frequency switching phase of a performance state transition (see Chapter 14, “Power and
Thermal Management”)
The performance counter for this event counts across performance state transitions using different core clock
frequencies
• Instructions Retired — Event select C0H, Umask 00H
This event counts the number of instructions at retirement. For instructions that consist of multiple micro-ops,
this event counts the retirement of the last micro-op of the instruction. An instruction with a REP prefix counts
as one instruction (not per iteration). Faults before the retirement of the last micro-op of a multi-ops instruction
are not counted.
Vol. 3B 18-5
PERFORMANCE MONITORING
This event does not increment under VM-exit conditions. Counters continue counting during hardware
interrupts, traps, and inside interrupt handlers.
• UnHalted Reference Cycles — Event select 3CH, Umask 01H
This event counts reference clock cycles at a fixed frequency while the clock signal on the core is running. The
event counts at a fixed frequency, irrespective of core frequency changes due to performance state transitions.
Processors may implement this behavior differently. Current implementations use the core crystal clock, TSC or
the bus clock. Because the rate may differ between implementations, software should calibrate it to a time
source with known frequency.
• Last Level Cache References — Event select 2EH, Umask 4FH
This event counts requests originating from the core that reference a cache line in the last level on-die cache.
The event count includes speculation and cache line fills due to the first-level cache hardware prefetcher, but
may exclude cache line fills due to other hardware-prefetchers.
Because cache hierarchy, cache sizes and other implementation-specific characteristics; value comparison to
estimate performance differences is not recommended.
• Last Level Cache Misses — Event select 2EH, Umask 41H
This event counts each cache miss condition for references to the last level on-die cache. The event count may
include speculation and cache line fills due to the first-level cache hardware prefetcher, but may exclude cache
line fills due to other hardware-prefetchers.
Because cache hierarchy, cache sizes and other implementation-specific characteristics; value comparison to
estimate performance differences is not recommended.
• Branch Instructions Retired — Event select C4H, Umask 00H
This event counts branch instructions at retirement. It counts the retirement of the last micro-op of a branch
instruction.
• All Branch Mispredict Retired — Event select C5H, Umask 00H
This event counts mispredicted branch instructions at retirement. It counts the retirement of the last micro-op
of a branch instruction in the architectural path of execution and experienced misprediction in the branch
prediction hardware.
Branch prediction hardware is implementation-specific across microarchitectures; value comparison to
estimate performance differences is not recommended.
• Topdown Slots — Event select A4H, Umask 01H
This event counts the total number of available slots for an unhalted logical processor.
The event increments by machine-width of the narrowest pipeline as employed by the Top-down Microarchi-
tecture Analysis method. The count is distributed among unhalted logical processors (hyper-threads) who
share the same physical core, in processors that support Intel Hyper-Threading Technology.
Software can use this event as the denominator for the top-level metrics of the Top-down Microarchitecture
Analysis method.
NOTE
Programming decisions or software precisians on functionality should not be based on the event
values or dependent on the existence of performance monitoring events.
18-6 Vol. 3B
PERFORMANCE MONITORING
NOTE
Early generation of processors based on Intel Core microarchitecture may report in
CPUID.0AH:EDX of support for version 2 but indicating incorrect information of version 2 facilities.
The IA32_FIXED_CTR_CTRL MSR include multiple sets of 4-bit field, each 4 bit field controls the operation of a
fixed-function performance counter. Figure 18-2 shows the layout of 4-bit controls for each fixed-function PMC.
Two sub-fields are currently defined within each control. The definitions of the bit fields are:
63 12 11 9 8 7 5 43 2 1 0
P P E P E
M E M
N N M N
I I I
Reserved
Vol. 3B 18-7
PERFORMANCE MONITORING
• Enable field (lowest 2 bits within each 4-bit control) — When bit 0 is set, performance counting is
enabled in the corresponding fixed-function performance counter to increment while the target condition
associated with the architecture performance event occurred at ring 0. When bit 1 is set, performance counting
is enabled in the corresponding fixed-function performance counter to increment while the target condition
associated with the architecture performance event occurred at ring greater than 0. Writing 0 to both bits stops
the performance counter. Writing a value of 11B enables the counter to increment irrespective of privilege
levels.
• PMI field (the fourth bit within each 4-bit control) — When set, the logical processor generates an
exception through its local APIC on overflow condition of the respective fixed-function counter.
IA32_PERF_GLOBAL_CTRL MSR provides single-bit controls to enable counting of each performance counter.
Figure 18-3 shows the layout of IA32_PERF_GLOBAL_CTRL. Each enable bit in IA32_PERF_GLOBAL_CTRL is
AND’ed with the enable bits for all privilege levels in the respective IA32_PERFEVTSELx or
IA32_PERF_FIXED_CTR_CTRL MSRs to start/stop the counting of respective counters. Counting is enabled if the
AND’ed results is true; counting is disabled when the result is false.
63 35 34 33 32 31 2 1 0
IA32_FIXED_CTR2 enable
IA32_FIXED_CTR1 enable
IA32_FIXED_CTR0 enable
IA32_PMC1 enable
IA32_PMC0 enable
Reserved
The behavior of the fixed function performance counters supported by architectural performance version 2 is
expected to be consistent on all processors that support those counters, and is defined as follows.
18-8 Vol. 3B
PERFORMANCE MONITORING
Table 18-2. Association of Fixed-Function Performance Counters with Architectural Performance Events
Fixed-Function Address Event Mask Mnemonic Description
Performance Counter
IA32_FIXED_CTR0 309H INST_RETIRED.ANY This event counts the number of instructions that retire
execution. For instructions that consist of multiple uops,
this event counts the retirement of the last uop of the
instruction. The counter continues counting during
hardware interrupts, traps, and in-side interrupt handlers.
IA32_FIXED_CTR1 30AH CPU_CLK_UNHALTED.THREAD The CPU_CLK_UNHALTED.THREAD event counts the
CPU_CLK_UNHALTED.CORE number of core cycles while the logical processor is not in a
halt state.
If there is only one logical processor in a processor core,
CPU_CLK_UNHALTED.CORE counts the unhalted cycles of
the processor core.
The core frequency may change from time to time due to
transitions associated with Enhanced Intel SpeedStep
Technology or TM2. For this reason this event may have a
changing ratio with regards to time.
IA32_FIXED_CTR2 30BH CPU_CLK_UNHALTED.REF_TSC This event counts the number of reference cycles at the
TSC rate when the core is not in a halt state and not in a TM
stop-clock state. The core enters the halt state when it is
running the HLT instruction or the MWAIT instruction. This
event is not affected by core frequency changes (e.g., P
states) but counts at the same frequency as the time stamp
counter. This event can approximate elapsed time while the
core was not in a halt state and not in a TM stopclock state.
IA32_FIXED_CTR3 30CH TOPDOWN.SLOTS This event counts the number of available slots for an
unhalted logical processor. The event increments by
machine-width of the narrowest pipeline as employed by
the Top-down Microarchitecture Analysis method. The
count is distributed among unhalted logical processors
(hyper-threads) who share the same physical core.
Software can use this event as the denominator for the
top-level metrics of the Top-down Microarchitecture
Analysis method.
IA32_PERF_GLOBAL_STATUS MSR provides single-bit status for software to query the overflow condition of each
performance counter. IA32_PERF_GLOBAL_STATUS[bit 62] indicates overflow conditions of the DS area data
buffer. IA32_PERF_GLOBAL_STATUS[bit 63] provides a CondChgd bit to indicate changes to the state of perfor-
mance monitoring hardware. Figure 18-4 shows the layout of IA32_PERF_GLOBAL_STATUS. A value of 1 in bits 0,
1, 32 through 34 indicates a counter overflow condition has occurred in the associated counter.
When a performance counter is configured for PEBS, overflow condition in the counter generates a performance-
monitoring interrupt signaling a PEBS event. On a PEBS event, the processor stores data records into the buffer
area (see Section 18.15.5), clears the counter overflow status., and sets the “OvfBuffer” bit in
IA32_PERF_GLOBAL_STATUS.
Vol. 3B 18-9
PERFORMANCE MONITORING
63 62 35 34 33 32 31 2 1 0
CondChgd
OvfDSBuffer
IA32_FIXED_CTR2 Overflow
IA32_FIXED_CTR1 Overflow
IA32_FIXED_CTR0 Overflow
IA32_PMC1 Overflow
IA32_PMC0 Overflow
Reserved
IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of any general-purpose or fixed-
function counters via a single WRMSR. Software should clear overflow indications when
• Setting up new values in the event select and/or UMASK field for counting or interrupt-based event sampling.
• Reloading counter values to continue collecting next sample.
• Disabling event counting or interrupt-based event sampling.
The layout of IA32_PERF_GLOBAL_OVF_CTL is shown in Figure 18-5.
63 62 35 34 33 32 31 2 1 0
ClrCondChgd
ClrOvfDSBuffer
IA32_FIXED_CTR2 ClrOverflow
IA32_FIXED_CTR1 ClrOverflow
IA32_FIXED_CTR0 ClrOverflow
IA32_PMC1 ClrOverflow
IA32_PMC0 ClrOverflow
Reserved
18-10 Vol. 3B
PERFORMANCE MONITORING
63 31 24 23 22 21 20 19 18 17 16 15 8 7 0
A I
Counter Mask I E N U
N N P E O S Unit Mask (UMASK) Event Select
(CMASK) V N Y T C S R
Figure 18-6. Layout of IA32_PERFEVTSELx MSRs Supporting Architectural Performance Monitoring Version 3
63 12 11 9 8 7 5 43 2 1 0
P A E
P A E P A E
M N M N M N
I Y N N N
I Y I Y
Each control block for a fixed-function performance counter provides an AnyThread (bit position 2 + 4*N, N=
0, 1, etc.) bit. When set to 1, it enables counting the associated event conditions (including matching the
thread’s CPL with the ENABLE setting of the corresponding control block of IA32_PERF_FIXED_CTR_CTRL)
occurring across all logical processors sharing a processor core. When an AnyThread bit is 0 in
IA32_PERF_FIXED_CTR_CTRL, the corresponding fixed counter only increments the associated event
conditions occurring in the logical processor which programmed the IA32_PERF_FIXED_CTR_CTRL MSR.
• The IA32_PERF_GLOBAL_CTRL, IA32_PERF_GLOBAL_STATUS, IA32_PERF_GLOBAL_OVF_CTRL MSRs provide
single-bit controls/status for each general-purpose and fixed-function performance counter. Figure 18-8 and
Figure 18-9 show the layout of these MSRs for N general-purpose performance counters (where N is reported
by CPUID.0AH:EAX[15:8]) and three fixed-function counters.
NOTE
The number of general-purpose performance monitoring counters (i.e., N in Figure 18-9) can vary
across processor generations within a processor family, across processor families, or could be
Vol. 3B 18-11
PERFORMANCE MONITORING
different depending on the configuration chosen at boot time in the BIOS regarding Intel Hyper
Threading Technology, (e.g. N=2 for 45 nm Intel Atom processors; N =4 for processors based on
the Nehalem microarchitecture; for processors based on the Sandy Bridge microarchitecture, N =
4 if Intel Hyper Threading Technology is active and N=8 if not active). In addition, the number of
counters may vary from the number of physical counters present on the hardware, because an
agent running at a higher privilege level (e.g., a VMM) may not expose all counters.
Reserved
IA32_FIXED_CTR2 enable
IA32_FIXED_CTR1 enable
IA32_FIXED_CTR0 enable
IA32_PMC(N-1) enable
.................... enable
IA32_PMC1 enable
IA32_PMC0 enable
CondChgd
OvfDSBuffer
OvfUncore
IA32_FIXED_CTR2 Overflow IA32_PMC(N-1) Overflow
IA32_FIXED_CTR1 Overflow ...................... Overflow
IA32_FIXED_CTR0 Overflow
IA32_PMC1 Overflow
IA32_PMC0 Overflow
ClrCondChgd
ClrOvfDSBuffer
ClrOvfUncore
IA32_FIXED_CTR2 ClrOverflow
IA32_PMC(N-1) ClrOverflow
IA32_FIXED_CTR1 ClrOverflow ........................ ClrOverflow
IA32_FIXED_CTR0 ClrOverflow
IA32_PMC1 ClrOverflow
IA32_PMC0 ClrOverflow
Figure 18-9. Global Performance Monitoring Overflow Status and Control MSRs
18-12 Vol. 3B
PERFORMANCE MONITORING
simple software environments of an earlier era, the evolution contemporary software environments introduce
certain concepts and pre-requisites that AnyThread counting does not comply with.
One example is the proliferation of software environments that support multiple virtual machines (VM) under VMX
(see Chapter 23, “Introduction to Virtual-Machine Extensions”) where each VM represents a domain separated
from one another.
A Virtual Machine Monitor (VMM) that manages the VMs may allow individual VM to employ performance moni-
toring facilities to profiles the performance characteristics of a workload. The use of the Anythread interface in
IA32_PERFEVTSELx and IA32_FIXED_CTR_CTRL is discouraged with software environments supporting virtualiza-
tion or requiring domain separation.
Specifically, Intel recommends VMM:
• Configure the MSR bitmap to cause VM-exits for WRMSR to IA32_PERFEVTSELx and IA32_FIXED_CTR_CTRL in
VMX non-Root operation (see CHAPTER 24 for additional information),
• Clear the AnyThread bit of IA32_PERFEVTSELx and IA32_FIXED_CTR_CTRL in the MSR-load lists for VM exits
and VM entries (see CHAPTER 24, CHAPTER 26, and CHAPTER 27).
Even when operating in simpler legacy software environments which might not emphasize the pre-requisites of a
virtualized software environment, the use of the AnyThread interface should be moderated and follow any event-
specific guidance where explicitly noted (see relevant sections of Chapter 19, “Performance Monitoring Events”).
Vol. 3B 18-13
PERFORMANCE MONITORING
63 62 61 60 59 58 55 35 34 33 32 31 N .. .. 1 0
CondChgd
OvfDSBuffer
OvfUncore
ASCI IA32_PMC(N-1) Overflow
CTR_Frz ...................... Overflow
LBR_Frz IA32_PMC1 Overflow
TraceToPAPMI IA32_PMC0 Overflow
IA32_FIXED_CTR2 Overflow
IA32_FIXED_CTR1 Overflow Reserved
IA32_FIXED_CTR0 Overflow
18-14 Vol. 3B
PERFORMANCE MONITORING
63 62 61 60 59 58 55 35 34 33 32 31 N .. .. 1 0
Clr CondChgd
Clr OvfDSBuffer
Clr OvfUncore
Clr ASCI Clr IA32_PMC(N-1) Ovf
Clr CTR_Frz Clr ...................... Ovf
Clr LBR_Frz Clr IA32_PMC1 Ovf
Clr TraceToPAPMI Clr IA32_PMC0 Ovf
Clr IA32_FIXED_CTR2 Ovf
Clr IA32_FIXED_CTR1 Ovf Reserved
Clr IA32_FIXED_CTR0 Ovf
63 62 61 60 59 58 55 35 34 33 32 31 N .. .. 1 0
Set OvfDSBuffer
Set OvfUncore
Set ASCI Set IA32_PMC(N-1) Ovf
Set CTR_Frz Set ...................... Ovf
Set LBR_Frz Set IA32_PMC1 Ovf
Set TraceToPAPMI Set IA32_PMC0 Ovf
Set IA32_FIXED_CTR2 Ovf
Set IA32_FIXED_CTR1 Ovf Reserved
Set IA32_FIXED_CTR0 Ovf
1. Available at http://www.intel.com/sdm
Vol. 3B 18-15
PERFORMANCE MONITORING
63 35 34 33 32 31 N .. .. 1 0
PMI InUse
FIXED_CTR2 InUse
PERFEVTSEL(N-1) InUse
FIXED_CTR1 InUse
....................... InUse
FIXED_CTR0 InUse
PERFEVTSEL1 InUse
PERFEVTSEL0 InUse
N = CPUID.0AH:EAX[15:8] Reserved
The IA32_PERF_GLOBAL_INUSE MSR provides an “InUse” bit for each programmable performance counter and
fixed counter in the processor. Additionally, it includes an indicator if the PMI mechanism has been configured by a
profiling agent.
• IA32_PERF_GLOBAL_INUSE.PERFEVTSEL0_InUse[bit 0]: This bit reflects the logical state of
(IA32_PERFEVTSEL0[7:0] != 0).
• IA32_PERF_GLOBAL_INUSE.PERFEVTSEL1_InUse[bit 1]: This bit reflects the logical state of
(IA32_PERFEVTSEL1[7:0] != 0).
• IA32_PERF_GLOBAL_INUSE.PERFEVTSEL2_InUse[bit 2]: This bit reflects the logical state of
(IA32_PERFEVTSEL2[7:0] != 0).
• IA32_PERF_GLOBAL_INUSE.PERFEVTSELn_InUse[bit n]: This bit reflects the logical state of
(IA32_PERFEVTSELn[7:0] != 0), n < CPUID.0AH:EAX[15:8].
• IA32_PERF_GLOBAL_INUSE.FC0_InUse[bit 32]: This bit reflects the logical state of
(IA32_FIXED_CTR_CTRL[1:0] != 0).
• IA32_PERF_GLOBAL_INUSE.FC1_InUse[bit 33]: This bit reflects the logical state of
(IA32_FIXED_CTR_CTRL[5:4] != 0).
• IA32_PERF_GLOBAL_INUSE.FC2_InUse[bit 34]: This bit reflects the logical state of
(IA32_FIXED_CTR_CTRL[9:8] != 0).
• IA32_PERF_GLOBAL_INUSE.PMI_InUse[bit 63]: This bit is set if any one of the following bit is set:
— IA32_PERFEVTSELn.INT[bit 20], n < CPUID.0AH:EAX[15:8].
— IA32_FIXED_CTR_CTRL.ENi_PMI, i = 0, 1, 2.
— Any IA32_PEBS_ENABLES bit which enables PEBS for a general-purpose or fixed-function performance
counter.
18-16 Vol. 3B
PERFORMANCE MONITORING
Vol. 3B 18-17
PERFORMANCE MONITORING
If IA32_A_PMCi is present, the 64-bit input value (EDX:EAX) of WRMSR to IA32_A_PMCi will cause IA32_PMCi to
be updated by:
COUNTERWIDTH = CPUID.0AH:EAX[23:16] bit width of the performance monitoring counter
IA32_PMCi[COUNTERWIDTH-1:32] ← EDX[COUNTERWIDTH-33:0]);
IA32_PMCi[31:0] ← EAX[31:0];
EDX[63:COUNTERWIDTH] are reserved
18.3.1 Performance Monitoring for Processors Based on Intel® Microarchitecture Code Name
Nehalem
Intel Core i7 processor family2 supports architectural performance monitoring capability with version ID 3 (see
Section 18.2.3) and a host of non-architectural monitoring capabilities. The Intel Core i7 processor family is based
on Intel® microarchitecture code name Nehalem, and provides four general-purpose performance counters
(IA32_PMC0, IA32_PMC1, IA32_PMC2, IA32_PMC3) and three fixed-function performance counters
(IA32_FIXED_CTR0, IA32_FIXED_CTR1, IA32_FIXED_CTR2) in the processor core.
Non-architectural performance monitoring in Intel Core i7 processor family uses the IA32_PERFEVTSELx MSR to
configure a set of non-architecture performance monitoring events to be counted by the corresponding general-
purpose performance counter. The list of non-architectural performance monitoring events is listed in Table 19-31.
Non-architectural performance monitoring events fall into two broad categories:
• Performance monitoring events in the processor core: These include many events that are similar to
performance monitoring events available to processor based on Intel Core microarchitecture. Additionally,
there are several enhancements in the performance monitoring capability for detecting microarchitectural
conditions in the processor core or in the interaction of the processor core to the off-core sub-systems in the
physical processor package. The off-core sub-systems in the physical processor package is loosely referred to
as “uncore“.
• Performance monitoring events in the uncore: The uncore sub-system is shared by more than one processor
cores in the physical processor package. It provides additional performance monitoring facility outside of
IA32_PMCx and performance monitoring events that are specific to the uncore sub-system.
Architectural and non-architectural performance monitoring events in Intel Core i7 processor family support thread
qualification using bit 21 of IA32_PERFEVTSELx MSR.
The bit fields within each IA32_PERFEVTSELx MSR are defined in Figure 18-6 and described in Section 18.2.1.1 and
Section 18.2.3.
2. Intel Xeon processor 5500 series and 3400 series are also based on Intel microarchitecture code name Nehalem; the performance
monitoring facilities described in this section generally also apply.
18-18 Vol. 3B
PERFORMANCE MONITORING
63 62 61 60 353433 3231 8 7 6 5 43 2 1 0
CHG (R/W)
OVF_PMI (R/W)
OVF_FC2 (R/O)
OVF_FC1 (R/O)
OVF_FC0 (R/O)
OVF_PC7 (R/O), if CCNT>7
OVF_PC6 (R/O), if CCNT>6
OVF_PC5 (R/O), if CCNT>5
OVF_PC4 (R/O), if CCNT>4
OVF_PC3 (R/O)
OVF_PC2 (R/O)
OVF_PC1 (R/O)
OVF_PC0 (R/O)
Reserved RESET Value — 00000000_00000000H CCNT: CPUID.AH:EAX[15:8]
NOTE
The number of counters available to software may vary from the number of physical counters
present on the hardware, because an agent running at a higher privilege level (e.g., a VMM) may
not expose all counters. CPUID.0AH:EAX[15:8] reports the MSRs available to software; see Section
18.2.1.
Vol. 3B 18-19
PERFORMANCE MONITORING
Additionally, the PEBS record is expanded to allow latency information to be captured. The MSR
IA32_PEBS_ENABLE provides 4 additional bits that software must use to enable latency data recording in the PEBS
record upon the respective IA32_PMCx overflow condition. The layout of IA32_PEBS_ENABLE for processors based
on Intel microarchitecture code name Nehalem is shown in Figure 18-15.
When a counter is enabled to capture machine state (PEBS_EN_PMCx = 1), the processor will write machine state
information to a memory buffer specified by software as detailed below. When the counter IA32_PMCx overflows
from maximum count to zero, the PEBS hardware is armed.
63 36 3534 33 32 31 8 7 6 5 43 2 1 0
LL_EN_PMC3 (R/W)
LL_EN_PMC2 (R/W)
LL_EN_PMC1 (R/W)
LL_EN_PMC0 (R/W)
PEBS_EN_PMC3 (R/W)
PEBS_EN_PMC2 (R/W)
PEBS_EN_PMC1 (R/W)
PEBS_EN_PMC0 (R/W)
Reserved RESET Value — 00000000_00000000H
Upon occurrence of the next PEBS event, the PEBS hardware triggers an assist and causes a PEBS record to be
written. The format of the PEBS record is indicated by the bit field IA32_PERF_CAPABILITIES[11:8] (see
Figure 18-63).
The behavior of PEBS assists is reported by IA32_PERF_CAPABILITIES[6] (see Figure 18-63). The return instruc-
tion pointer (RIP) reported in the PEBS record will point to the instruction after (+1) the instruction that causes the
PEBS assist. The machine state reported in the PEBS record is the machine state after the instruction that causes
the PEBS assist is retired. For instance, if the instructions:
mov eax, [eax] ; causes PEBS assist
nop
are executed, the PEBS record will report the address of the nop, and the value of EAX in the PEBS record will show
the value read from memory, not the target address of the read operation.
The PEBS record format is shown in Table 18-3, and each field in the PEBS record is 64 bits long. The PEBS record
format, along with debug/store area storage format, does not change regardless of IA-32e mode is active or not.
CPUID.01H:ECX.DTES64[bit 2] reports whether the processor's DS storage format support is mode-independent.
When set, it uses 64-bit DS storage format.
Table 18-3. PEBS Record Format for Intel Core i7 Processor Family
Byte Offset Field Byte Offset Field
00H R/EFLAGS 58H R9
08H R/EIP 60H R10
10H R/EAX 68H R11
18H R/EBX 70H R12
20H R/ECX 78H R13
28H R/EDX 80H R14
30H R/ESI 88H R15
18-20 Vol. 3B
PERFORMANCE MONITORING
Table 18-3. PEBS Record Format for Intel Core i7 Processor Family
Byte Offset Field Byte Offset Field
38H R/EDI 90H IA32_PERF_GLOBAL_STATUS
40H R/EBP 98H Data Linear Address
48H R/ESP A0H Data Source Encoding
50H R8 A8H Latency value (core cycles)
In IA-32e mode, the full 64-bit value is written to the register. If the processor is not operating in IA-32e mode, 32-
bit value is written to registers with bits 63:32 zeroed. Registers not defined when the processor is not in IA-32e
mode are written to zero.
Bytes AFH:90H are enhancement to the PEBS record format. Support for this enhanced PEBS record format is indi-
cated by IA32_PERF_CAPABILITIES[11:8] encoding of 0001B.
The value written to bytes 97H:90H is the state of the IA32_PERF_GLOBAL_STATUS register before the PEBS assist
occurred. This value is written so software can determine which counters overflowed when this PEBS record was
written. Note that this field indicates the overflow status for all counters, regardless of whether they were
programmed for PEBS or not.
Programming PEBS Facility
Only a subset of non-architectural performance events in the processor support PEBS. The subset of precise events
are listed in Table 18-78. In addition to using IA32_PERFEVTSELx to specify event unit/mask settings and setting
the EN_PMCx bit in the IA32_PEBS_ENABLE register for the respective counter, the software must also initialize the
DS_BUFFER_MANAGEMENT_AREA data structure in memory to support capturing PEBS records for precise events.
NOTE
PEBS events are only valid when the following fields of IA32_PERFEVTSELx are all zero: AnyThread,
Edge, Invert, CMask.
The beginning linear address of the DS_BUFFER_MANAGEMENT_AREA data structure must be programmed into
the IA32_DS_AREA register. The layout of the DS_BUFFER_MANAGEMENT_AREA is shown in Figure 18-16.
• PEBS Buffer Base: This field is programmed with the linear address of the first byte of the PEBS buffer
allocated by software. The processor reads this field to determine the base address of the PEBS buffer.
Software should allocate this memory from the non-paged pool.
• PEBS Index: This field is initially programmed with the same value as the PEBS Buffer Base field, or the
beginning linear address of the PEBS buffer. The processor reads this field to determine the location of the next
PEBS record to write to. After a PEBS record has been written, the processor also updates this field with the
address of the next PEBS record to be written. The figure above illustrates the state of PEBS Index after the first
PEBS record is written.
• PEBS Absolute Maximum: This field represents the absolute address of the maximum length of the allocated
PEBS buffer plus the starting address of the PEBS buffer. The processor will not write any PEBS record beyond
the end of PEBS buffer, when PEBS Index equals PEBS Absolute Maximum. No signaling is generated when
PEBS buffer is full. Software must reset the PEBS Index field to the beginning of the PEBS buffer address to
continue capturing PEBS records.
Vol. 3B 18-21
PERFORMANCE MONITORING
IA32_DS_AREA MSR
PEBS Record n
• PEBS Interrupt Threshold: This field specifies the threshold value to trigger a performance interrupt and
notify software that the PEBS buffer is nearly full. This field is programmed with the linear address of the first
byte of the PEBS record within the PEBS buffer that represents the threshold record. After the processor writes
a PEBS record and updates PEBS Index, if the PEBS Index reaches the threshold value of this field, the
processor will generate a performance interrupt. This is the same interrupt that is generated by a performance
counter overflow, as programmed in the Performance Monitoring Counters vector in the Local Vector Table of
the Local APIC. When a performance interrupt due to PEBS buffer full is generated, the
IA32_PERF_GLOBAL_STATUS.PEBS_Ovf bit will be set.
• PEBS CounterX Reset: This field allows software to set up PEBS counter overflow condition to occur at a rate
useful for profiling workload, thereby generating multiple PEBS records to facilitate characterizing the profile
the execution of test code. After each PEBS record is written, the processor checks each counter to see if it
overflowed and was enabled for PEBS (the corresponding bit in IA32_PEBS_ENABLED was set). If these
conditions are met, then the reset value for each overflowed counter is loaded from the DS Buffer Management
Area. For example, if counter IA32_PMC0 caused a PEBS record to be written, then the value of “PEBS Counter
0 Reset” would be written to counter IA32_PMC0. If a counter is not enabled for PEBS, its value will not be
modified by the PEBS assist.
Performance Counter Prioritization
Performance monitoring interrupts are triggered by a counter transitioning from maximum count to zero (assuming
IA32_PerfEvtSelX.INT is set). This same transition will cause PEBS hardware to arm, but not trigger. PEBS hard-
ware triggers upon detection of the first PEBS event after the PEBS hardware has been armed (a 0 to 1 transition
of the counter). At this point, a PEBS assist will be undertaken by the processor.
18-22 Vol. 3B
PERFORMANCE MONITORING
Performance counters (fixed and general-purpose) are prioritized in index order. That is, counter IA32_PMC0 takes
precedence over all other counters. Counter IA32_PMC1 takes precedence over counters IA32_PMC2 and
IA32_PMC3, and so on. This means that if simultaneous overflows or PEBS assists occur, the appropriate action will
be taken for the highest priority performance counter. For example, if IA32_PMC1 cause an overflow interrupt and
IA32_PMC2 causes an PEBS assist simultaneously, then the overflow interrupt will be serviced first.
The PEBS threshold interrupt is triggered by the PEBS assist, and is by definition prioritized lower than the PEBS
assist. Hardware will not generate separate interrupts for each counter that simultaneously overflows. General-
purpose performance counters are prioritized over fixed counters.
If a counter is programmed with a precise (PEBS-enabled) event and programmed to generate a counter overflow
interrupt, the PEBS assist is serviced before the counter overflow interrupt is serviced. If in addition the PEBS inter-
rupt threshold is met, the
threshold interrupt is generated after the PEBS assist completes, followed by the counter overflow interrupt (two
separate interrupts are generated).
Uncore counters may be programmed to interrupt one or more processor cores (see Section 18.3.1.2). It is
possible for interrupts posted from the uncore facility to occur coincident with counter overflow interrupts from the
processor core. Software must check core and uncore status registers to determine the exact origin of counter
overflow interrupts.
Vol. 3B 18-23
PERFORMANCE MONITORING
• Data Source: The encoded value indicates the origin of the data obtained by the load instruction. The encoding
is shown in Table 18-4. In the descriptions local memory refers to system memory physically attached to a
processor package, and remote memory referrals to system memory physically attached to another processor
package.
63 1615 0
Bits 15:0 specifies the threshold load latency in core clock cycles. Performance events with latencies greater than
this value are counted in IA32_PMCx and their latency information is reported in the PEBS record. Otherwise, they
are ignored. The minimum value that may be programmed in this field is 3.
18-24 Vol. 3B
PERFORMANCE MONITORING
The layout of MSR_OFFCORE_RSP_0 is shown in Figure 18-18. Bits 7:0 specifies the request type of a transaction
request to the uncore. Bits 15:8 specifies the response of the uncore subsystem.
63 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Figure 18-18. Layout of MSR_OFFCORE_RSP_0 and MSR_OFFCORE_RSP_1 to Configure Off-core Response Events
Vol. 3B 18-25
PERFORMANCE MONITORING
18-26 Vol. 3B
PERFORMANCE MONITORING
• PMI_FRZ (bit 63): When set, all U-clock uncore counters are disabled when any one of them signals a
performance interrupt. Software must explicitly re-enable the counter by setting the enable bits in
MSR_UNCORE_PERF_GLOBAL_CTRL upon exit from the ISR.
63 62 51 50 49 48 32 31 8 7 6 5 43 2 1 0
PMI_FRZ (R/W)
EN_PMI_CORE3 (R/W)
EN_PMI_CORE2 (R/W)
EN_PMI_CORE1 (R/W)
EN_PMI_CORE0 (R/W)
EN_FC0 (R/W)
EN_PC7 (R/W)
EN_PC6 (R/W)
EN_PC5 (R/W)
EN_PC4 (R/W)
EN_PC3 (R/W)
EN_PC2 (R/W)
EN_PC1 (R/W)
EN_PC0 (R/W)
Vol. 3B 18-27
PERFORMANCE MONITORING
63 62 61 60 32 31 8 7 6 5 43 2 1 0
CHG (R/W)
OVF_PMI (R/W)
OVF_FC0 (R/O)
OVF_PC7 (R/O)
OVF_PC6 (R/O)
OVF_PC5 (R/O)
OVF_PC4 (R/O)
OVF_PC3 (R/O)
OVF_PC2 (R/O)
OVF_PC1 (R/O)
OVF_PC0 (R/O)
63 62 61 60 32 31 8 7 6 5 43 2 1 0
CLR_CHG (WO1)
CLR_OVF_PMI (WO1)
CLR_OVF_FC0 (WO1)
CLR_OVF_PC7 (WO1)
CLR_OVF_PC6 (WO1)
CLR_OVF_PC5 (WO1)
CLR_OVF_PC4 (WO1)
CLR_OVF_PC3 (WO1)
CLR_OVF_PC2 (WO1)
CLR_OVF_PC1 (WO1)
CLR_OVF_PC0 (WO1)
• CLR_OVF_PCn (bit n, n = 0, 7): Set this bit to clear the overflow status for general-purpose uncore counter
MSR_UNCORE_PerfCntr n. Writing a value other than 1 is ignored.
• CLR_OVF_FC0 (bit 32): Set this bit to clear the overflow status for the fixed-function uncore counter
MSR_UNCORE_FixedCntr0. Writing a value other than 1 is ignored.
• CLR_OVF_PMI (bit 61): Set this bit to clear the OVF_PMI flag in MSR_UNCORE_PERF_GLOBAL_STATUS. Writing
a value other than 1 is ignored.
• CLR_CHG (bit 63): Set this bit to clear the CHG flag in MSR_UNCORE_PERF_GLOBAL_STATUS register. Writing
a value other than 1 is ignored.
18-28 Vol. 3B
PERFORMANCE MONITORING
63 31 24 23 22 21 20 19 18 17 16 15 8 7 0
Counter Mask
(CMASK) Unit Mask (UMASK) Event Select
• Event Select (bits 7:0): Selects the event logic unit used to detect uncore events.
• Unit Mask (bits 15:8) : Condition qualifiers for the event selection logic specified in the Event Select field.
• OCC_CTR_RST (bit17): When set causes the queue occupancy counter associated with this event to be cleared
(zeroed). Writing a zero to this bit will be ignored. It will always read as a zero.
• Edge Detect (bit 18): When set causes the counter to increment when a deasserted to asserted transition
occurs for the conditions that can be expressed by any of the fields in this register.
• PMI (bit 20): When set, the uncore will generate an interrupt request when this counter overflowed. This
request will be routed to the logical processors as enabled in the PMI enable bits (EN_PMI_COREx) in the
register MSR_UNCORE_PERF_GLOBAL_CTRL.
• EN (bit 22): When clear, this counter is locally disabled. When set, this counter is locally enabled and counting
starts when the corresponding EN_PCx bit in MSR_UNCORE_PERF_GLOBAL_CTRL is set.
• INV (bit 23): When clear, the Counter Mask field is interpreted as greater than or equal to. When set, the
Counter Mask field is interpreted as less than.
• Counter Mask (bits 31:24): When this field is clear, it has no effect on counting. When set to a value other than
zero, the logical processor compares this field to the event counts on each core clock cycle. If INV is clear and
the event counts are greater than or equal to this field, the counter is incremented by one. If INV is set and the
event counts are less than this field, the counter is incremented by one. Otherwise the counter is not incre-
mented.
Figure 18-23 shows the layout of MSR_UNCORE_FIXED_CTR_CTRL.
63 8 7 6 5 43 2 1 0
Vol. 3B 18-29
PERFORMANCE MONITORING
• EN (bit 0): When clear, the uncore fixed-function counter is locally disabled. When set, it is locally enabled and
counting starts when the EN_FC0 bit in MSR_UNCORE_PERF_GLOBAL_CTRL is set.
• PMI (bit 2): When set, the uncore will generate an interrupt request when the uncore fixed-function counter
overflowed. This request will be routed to the logical processors as enabled in the PMI enable bits
(EN_PMI_COREx) in the register MSR_UNCORE_PERF_GLOBAL_CTRL.
Both the general-purpose counters (MSR_UNCORE_PerfCntr) and the fixed-function counter
(MSR_UNCORE_FixedCntr0) are 48 bits wide. They support both counting and interrupt based sampling usages.
The event logic unit can filter event counts to specific regions of code or transaction types incoming to the home
node logic.
63 60 48 47 40 39 3 2 0
Opcode ADDR
MatchSel—Select addr/Opcode
Opcode—Opcode and Message
ADDR—Bits 39:4 of physical address
Reserved RESET Value — 00000000_00000000H
• Addr (bits 39:3): The physical address to match if “MatchSel“ field is set to select address match. The uncore
performance counter will increment if the lowest 40-bit incoming physical address (excluding bits 2:0) for a
transaction request matches bits 39:3.
• Opcode (bits 47:40) : Bits 47:40 allow software to filter uncore transactions based on QPI link message
class/packed header opcode. These bits are consists two sub-fields:
— Bits 43:40 specify the QPI packet header opcode.
— Bits 47:44 specify the QPI message classes.
Table 18-7 lists the encodings supported in the opcode field.
18-30 Vol. 3B
PERFORMANCE MONITORING
• MatchSel (bits 63:61): Software specifies the match criteria according to the following encoding:
— 000B: Disable addr_opcode match hardware.
— 100B: Count if only the address field matches.
— 010B: Count if only the opcode field matches.
— 110B: Count if either opcode field matches or the address field matches.
— 001B: Count only if both opcode and address field match.
— Other encoding are reserved.
Vol. 3B 18-31
PERFORMANCE MONITORING
L3 Cache
SBox SBox
SMI Channels
SMI Channels
Figure 18-25. Distributed Units of the Uncore of Intel® Xeon® Processor 7500 Series
Table 18-8 summarizes the number MSRs for uncore PMU for each box.
The W-Box provides 4 general-purpose counters, each requiring an event select configuration MSR, similar to the
general-purpose counters in other boxes. There is also a fixed-function counter that increments clockticks in the
uncore clock domain.
For C,S,B,M,R, and W boxes, each box provides an MSR to enable/disable counting, configuring PMI of multiple
counters within the same box, this is somewhat similar the “global control“ programming interface,
IA32_PERF_GLOBAL_CTRL, offered in the core PMU. Similarly status information and counter overflow control for
multiple counters within the same box are also provided in C,S,B,M,R, and W boxes.
In the U-Box, MSR_U_PMON_GLOBAL_CTL provides overall uncore PMU enable/disable and PMI configuration
control. The scope of status information in the U-box is at per-box granularity, in contrast to the per-box status
information MSR (in the C,S,B,M,R, and W boxes) providing status information of individual counter overflow. The
difference in scope also apply to the overflow control MSR in the U-Box versus those in the other Boxes.
18-32 Vol. 3B
PERFORMANCE MONITORING
The individual MSRs that provide uncore PMU interfaces are listed in Chapter 2, “Model-Specific Registers (MSRs)”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4, Table 2-17 under the general
naming style of MSR_%box#%_PMON_%scope_function%, where %box#% designates the type of box and zero-
based index if there are more the one box of the same type, %scope_function% follows the examples below:
• Multi-counter enabling MSRs: MSR_U_PMON_GLOBAL_CTL, MSR_S0_PMON_BOX_CTL,
MSR_C7_PMON_BOX_CTL, etc.
• Multi-counter status MSRs: MSR_U_PMON_GLOBAL_STATUS, MSR_S0_PMON_BOX_STATUS,
MSR_C7_PMON_BOX_STATUS, etc.
• Multi-counter overflow control MSRs: MSR_U_PMON_GLOBAL_OVF_CTL, MSR_S0_PMON_BOX_OVF_CTL,
MSR_C7_PMON_BOX_OVF_CTL, etc.
• Performance counters MSRs: the scope is implicitly per counter, e.g. MSR_U_PMON_CTR,
MSR_S0_PMON_CTR0, MSR_C7_PMON_CTR5, etc.
• Event select MSRs: the scope is implicitly per counter, e.g. MSR_U_PMON_EVNT_SEL,
MSR_S0_PMON_EVNT_SEL0, MSR_C7_PMON_EVNT_SEL5, etc
• Sub-control MSRs: the scope is implicitly per-box granularity, e.g. MSR_M0_PMON_TIMESTAMP,
MSR_R0_PMON_IPERF0_P1, MSR_S1_PMON_MATCH.
Details of uncore PMU MSR bit field definitions can be found in a separate document “Intel Xeon Processor 7500
Series Uncore Performance Monitoring Guide“.
18.3.2 Performance Monitoring for Processors Based on Intel® Microarchitecture Code Name
Westmere
All of the performance monitoring programming interfaces (architectural and non-architectural core PMU facilities,
and uncore PMU) described in Section 18.6.3 also apply to processors based on Intel® microarchitecture code
name Westmere.
Table 18-5 describes a non-architectural performance monitoring event (event code 0B7H) and associated
MSR_OFFCORE_RSP_0 (address 1A6H) in the core PMU. This event and a second functionally equivalent offcore
response event using event code 0BBH and MSR_OFFCORE_RSP_1 (address 1A7H) are supported in processors
based on Intel microarchitecture code name Westmere. The event code and event mask definitions of Non-archi-
tectural performance monitoring events are listed in Table 19-31.
The load latency facility is the same as described in Section 18.3.1.1.2, but added enhancement to provide more
information in the data source encoding field of each load latency record. The additional information relates to
STLB_MISS and LOCK, see Table 18-13.
3. Exceptions are indicated for event code 0FH in Table 19-23; and valid bits of data source encoding field of each load
latency record is limited to bits 5:4 of Table 18-13.
Vol. 3B 18-33
PERFORMANCE MONITORING
Table 18-9 summarizes the number MSRs for uncore PMU for each box.
Table 18-9. Uncore PMU MSR Summary for Intel® Xeon® Processor E7 Family
Counter General Global
Box # of Boxes Counters per Box Width Purpose Enable Sub-control MSRs
C-Box 10 6 48 Yes per-box None
S-Box 2 4 48 Yes per-box Match/Mask
B-Box 2 4 48 Yes per-box Match/Mask
M-Box 2 6 48 Yes per-box Yes
R-Box 1 16 ( 2 port, 8 per port) 48 Yes per-box Yes
W-Box 1 4 48 Yes per-box None
1 48 No per-box None
U-Box 1 1 48 Yes uncore None
Details of the uncore performance monitoring facility of Intel Xeon Processor E7 family is available in the “Intel®
Xeon® Processor E7 Uncore Performance Monitoring Programming Reference Manual”.
18.3.4 Performance Monitoring for Processors Based on Intel® Microarchitecture Code Name
Sandy Bridge
Intel® Core™ i7-2xxx, Intel® Core™ i5-2xxx, Intel® Core™ i3-2xxx processor series, and Intel® Xeon® processor
E3-1200 family are based on Intel microarchitecture code name Sandy Bridge; this section describes the perfor-
mance monitoring facilities provided in the processor core. The core PMU supports architectural performance moni-
toring capability with version ID 3 (see Section 18.2.3) and a host of non-architectural monitoring capabilities.
Architectural performance monitoring version 3 capabilities are described in Section 18.2.3.
The core PMU’s capability is similar to those described in Section 18.3.1.1 and Section 18.6.3, with some differ-
ences and enhancements relative to Intel microarchitecture code name Westmere summarized in Table 18-10.
18-34 Vol. 3B
PERFORMANCE MONITORING
18.3.4.1 Global Counter Control Facilities In Intel® Microarchitecture Code Name Sandy Bridge
The number of general-purpose performance counters visible to a logical processor can vary across Processors
based on Intel microarchitecture code name Sandy Bridge. Software must use CPUID to determine the number
performance counters/event select registers (See Section 18.2.1.1).
63 35 34 33 32 31 8 7 6 5 4 3 2 10
FIXED_CTR2 enable
FIXED_CTR1 enable
FIXED_CTR0 enable
PMC7_EN (if PMC7 present)
PMC6_EN (if PMC6 present)
PMC5_EN (if PMC5 present)
PMC4_EN (if PMC4 present)
PMC3_EN
PMC2_EN
PMC1_EN
PMC0_EN
Figure 18-26. IA32_PERF_GLOBAL_CTRL MSR in Intel® Microarchitecture Code Name Sandy Bridge
Figure 18-42 depicts the layout of IA32_PERF_GLOBAL_CTRL MSR. The enable bits (PMC4_EN, PMC5_EN,
PMC6_EN, PMC7_EN) corresponding to IA32_PMC4-IA32_PMC7 are valid only if CPUID.0AH:EAX[15:8] reports a
value of ‘8’. If CPUID.0AH:EAX[15:8] = 4, attempts to set the invalid bits will cause #GP.
Each enable bit in IA32_PERF_GLOBAL_CTRL is AND’ed with the enable bits for all privilege levels in the respective
IA32_PERFEVTSELx or IA32_PERF_FIXED_CTR_CTRL MSRs to start/stop the counting of respective counters.
Counting is enabled if the AND’ed results is true; counting is disabled when the result is false.
IA32_PERF_GLOBAL_STATUS MSR provides single-bit status used by software to query the overflow condition of
each performance counter. IA32_PERF_GLOBAL_STATUS[bit 62] indicates overflow conditions of the DS area data
buffer (see Figure 18-27). A value of 1 in each bit of the PMCx_OVF field indicates an overflow condition has
occurred in the associated counter.
Vol. 3B 18-35
PERFORMANCE MONITORING
63 62 61 35 34 33 32 31 8 7 6 5 4 3 2 1 0
CondChgd
Ovf_DSBuffer
Ovf_UncorePMU
FIXED_CTR2 Overflow (RO)
FIXED_CTR1 Overflow (RO)
FIXED_CTR0 Overflow (RO)
PMC7_OVF (RO, If PMC7 present)
PMC6_OVF (RO, If PMC6 present)
PMC5_OVF (RO, If PMC5 present)
PMC4_OVF (RO, If PMC4 present)
PMC3_OVF (RO)
PMC2_OVF (RO)
PMC1_OVF (RO)
PMC0_OVF (RO)
Figure 18-27. IA32_PERF_GLOBAL_STATUS MSR in Intel® Microarchitecture Code Name Sandy Bridge
When a performance counter is configured for PEBS, an overflow condition in the counter will arm PEBS. On the
subsequent event following overflow, the processor will generate a PEBS event. On a PEBS event, the processor will
perform bounds checks based on the parameters defined in the DS Save Area (see Section 17.4.9). Upon
successful bounds checks, the processor will store the data record in the defined buffer area, clear the counter
overflow status, and reload the counter. If the bounds checks fail, the PEBS will be skipped entirely. In the event
that the PEBS buffer fills up, the processor will set the OvfBuffer bit in MSR_PERF_GLOBAL_STATUS.
IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow the indicators for general-purpose or fixed-
function counters via a single WRMSR (see Figure 18-28). Clear overflow indications when:
• Setting up new values in the event select and/or UMASK field for counting or interrupt based sampling.
• Reloading counter values to continue sampling.
• Disabling event counting or interrupt based sampling.
63 62 35 34 33 32 31 8 7 6 5 4 3 2 1 0
ClrCondChgd
ClrOvfDSBuffer
ClrOvfUncore
FIXED_CTR2 ClrOverflow
FIXED_CTR1 ClrOverflow
FIXED_CTR0 ClrOverflow
PMC7_ClrOvf (if PMC7 present)
PMC6_ClrOvf (if PMC6 present)
PMC5_ClrOvf (if PMC5 present)
PMC4_ClrOvf (if PMC4 present)
PMC3_ClrOvf
PMC2_ClrOvf
PMC1_ClrOvf
PMC0_ClrOvf
Reserved Valid if CPUID.0AH:EAX[15:8] = 8; else reserved
Figure 18-28. IA32_PERF_GLOBAL_OVF_CTRL MSR in Intel microarchitecture code name Sandy Bridge
18-36 Vol. 3B
PERFORMANCE MONITORING
Vol. 3B 18-37
PERFORMANCE MONITORING
NOTE
PEBS events are only valid when the following fields of IA32_PERFEVTSELx are all zero: AnyThread,
Edge, Invert, CMask.
In IA32_PEBS_ENABLE MSR, bit 63 is defined as PS_ENABLE: When set, this enables IA32_PMC3 to capture
precise store information. Only IA32_PMC3 supports the precise store facility. In typical usage of PEBS, the bit
fields in IA32_PEBS_ENABLE are written to when the agent software starts PEBS operation; the enabled bit fields
should be modified only when re-programming another PEBS event or cleared when the agent uses the perfor-
mance counters for non-PEBS operations.
63 62 36 3534 33 32 31 8 7 6 5 43 2 1 0
PS_EN (R/W)
LL_EN_PMC3 (R/W)
LL_EN_PMC2 (R/W)
LL_EN_PMC1 (R/W)
LL_EN_PMC0 (R/W)
PEBS_EN_PMC3 (R/W)
PEBS_EN_PMC2 (R/W)
PEBS_EN_PMC1 (R/W)
PEBS_EN_PMC0 (R/W)
Reserved RESET Value — 00000000_00000000H
18-38 Vol. 3B
PERFORMANCE MONITORING
Table 18-12. PEBS Performance Events for Intel® Microarchitecture Code Name Sandy Bridge
Event Name Event Select Sub-event UMask
INST_RETIRED C0H PREC_DIST 01H1
UOPS_RETIRED C2H All 01H
Retire_Slots 02H
BR_INST_RETIRED C4H Conditional 01H
Near_Call 02H
All_branches 04H
Near_Return 08H
Near_Taken 20H
BR_MISP_RETIRED C5H Conditional 01H
Near_Call 02H
All_branches 04H
Not_Taken 10H
Taken 20H
MEM_UOPS_RETIRED D0H STLB_MISS_LOADS 11H
STLB_MISS_STORE 12H
LOCK_LOADS 21H
SPLIT_LOADS 41H
SPLIT_STORES 42H
ALL_LOADS 81H
ALL_STORES 82H
MEM_LOAD_UOPS_RETIRED D1H L1_Hit 01H
L2_Hit 02H
L3_Hit 04H
Hit_LFB 40H
MEM_LOAD_UOPS_LLC_HIT_RETIRED D2H XSNP_Miss 01H
XSNP_Hit 02H
XSNP_Hitm 04H
XSNP_None 08H
NOTES:
1. Only available on IA32_PMC1.
Vol. 3B 18-39
PERFORMANCE MONITORING
programmed. The CMASK or INV fields of the IA32_PerfEvtSelX register used for counting load latency must be
0. Writing other values will result in undefined behavior.
• The MSR_PEBS_LD_LAT_THRESHOLD MSR is programmed with the desired latency threshold in core clock
cycles. Loads with latencies greater than this value are eligible for counting and latency data reporting. The
minimum value that may be programmed in this register is 3 (the minimum detectable load latency is 4 core
clock cycles).
• The PEBS enable bit in the IA32_PEBS_ENABLE register is set for the corresponding IA32_PMCx counter
register. This means that both the PEBS_EN_CTRX and LL_EN_CTRX bits must be set for the counter(s) of
interest. For example, to enable load latency on counter IA32_PMC0, the IA32_PEBS_ENABLE register must be
programmed with the 64-bit value 00000001.00000001H.
• When Load latency event is enabled, no other PEBS event can be configured with other counters.
When the load-latency facility is enabled, load operations are randomly selected by hardware and tagged to carry
information related to data source locality and latency. Latency and data source information of tagged loads are
updated internally. The MEM_TRANS_RETIRED event for load latency counts only tagged retired loads. If a load is
cancelled it will not be counted and the internal state of the load latency facility will not be updated. In this case the
hardware will tag the next available load.
When a PEBS assist occurs, the last update of latency and data source information are captured by the assist and
written as part of the PEBS record. The PEBS sample after value (SAV), specified in PEBS CounterX Reset, operates
orthogonally to the tagging mechanism. Loads are randomly tagged to collect latency data. The SAV controls the
number of tagged loads with latency information that will be written into the PEBS record field by the PEBS assists.
The load latency data written to the PEBS record will be for the last tagged load operation which retired just before
the PEBS assist was invoked.
The physical layout of the PEBS records is the same as shown in Table 18-3. The specificity of Data Source entry at
offset A0H has been enhanced to report three pieces of information.
18-40 Vol. 3B
PERFORMANCE MONITORING
Vol. 3B 18-41
PERFORMANCE MONITORING
The layout of MSR_OFFCORE_RSP_0 and MSR_OFFCORE_RSP_1 are shown in Figure 18-30 and Figure 18-31. Bits
15:0 specifies the request type of a transaction request to the uncore. Bits 30:16 specifies supplier information,
bits 37:31 specifies snoop response information.
63 37 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
18-42 Vol. 3B
PERFORMANCE MONITORING
63 37 36 35 34 33 32 31 22 21 20 19 18 17 16
To properly program this extra register, software must set at least one request type bit and a valid response type
pattern. Otherwise, the event count reported will be zero. It is permissible and useful to set multiple request and
response type bits in order to obtain various classes of off-core response events. Although MSR_OFFCORE_RSP_x
allow an agent software to program numerous combinations that meet the above guideline, not all combinations
produce meaningful data.
To specify a complete offcore response filter, software must properly program bits in the request and response type
fields. A valid request type must have at least one bit set in the non-reserved bits of 15:0. A valid response type
must be a non-zero value of the following expression:
ANY | [(‘OR’ of Supplier Info Bits) & (‘OR’ of Snoop Info Bits)]
If “ANY“ bit is set, the supplier and snoop info bits are ignored.
Vol. 3B 18-43
PERFORMANCE MONITORING
18.3.4.6 Uncore Performance Monitoring Facilities In Intel® Core™ i7-2xxx, Intel® Core™ i5-2xxx,
Intel® Core™ i3-2xxx Processor Series
The uncore sub-system in Intel® Core™ i7-2xxx, Intel® Core™ i5-2xxx, Intel® Core™ i3-2xxx processor series
provides a unified L3 that can support up to four processor cores. The L3 cache consists multiple slices, each slice
interface with a processor via a coherence engine, referred to as a C-Box. Each C-Box provides dedicated facility of
MSRs to select uncore performance monitoring events and each C-Box event select MSR is paired with a counter
register, similar in style as those described in Section 18.3.1.2.2. The ARB unit in the uncore also provides its local
performance counters and event select MSRs. The layout of the event select MSRs in the C-Boxes and the ARB unit
are shown in Figure 18-32.
63 28 24 23 22 21 20 19 18 17 16 15 8 7 0
Counter Mask
(CMASK) Unit Mask (UMASK) Event Select
Figure 18-32. Layout of Uncore PERFEVTSEL MSR for a C-Box Unit or the ARB Unit
18-44 Vol. 3B
PERFORMANCE MONITORING
The bit fields of the uncore event select MSRs for a C-box unit or the ARB unit are summarized below:
• Event_Select (bits 7:0) and UMASK (bits 15:8): Specifies the microarchitectural condition to count in a local
uncore PMU counter, see Table 19-20.
• E (bit 18): Enables edge detection filtering, if 1.
• OVF_EN (bit 20): Enables the overflow indicator from the uncore counter forwarded to
MSR_UNC_PERF_GLOBAL_CTRL, if 1.
• EN (bit 22): Enables the local counter associated with this event select MSR.
• INV (bit 23): Event count increments with non-negative value if 0, with negated value if 1.
• CMASK (bits 28:24): Specifies a positive threshold value to filter raw event count input.
At the uncore domain level, there is a master set of control MSRs that centrally manages all the performance moni-
toring facility of uncore units. Figure 18-33 shows the layout of the uncore domain global control.
When an uncore counter overflows, a PMI can be routed to a processor core. Bits 3:0 of
MSR_UNC_PERF_GLOBAL_CTRL can be used to select which processor core to handle the uncore PMI. Software
must then write to bit 13 of IA32_DEBUGCTL (at address 1D9H) to enable this capability.
• PMI_SEL_Core#: Enables the forwarding of an uncore PMI request to a processor core, if 1. If bit 30 (WakePMI)
is ‘1’, a wake request is sent to the respective processor core prior to sending the PMI.
• EN: Enables the fixed uncore counter, the ARB counters, and the CBO counters in the uncore PMU, if 1. This bit
is cleared if bit 31 (FREEZE) is set and any enabled uncore counters overflow.
• WakePMI: Controls sending a wake request to any halted processor core before issuing the uncore PMI request.
If a processor core was halted and not sent a wake request, the uncore PMI will not be serviced by the
processor core.
• FREEZE: Provides the capability to freeze all uncore counters when an overflow condition occurs in a unit
counter. When this bit is set, and a counter overflow occurs, the uncore PMU logic will clear the global enable
bit (bit 29).
63 32 31 30 29 28 4 3 2 1 0
FREEZE—Freeze counters
WakePMI—Wake cores on PMI
EN—Enable all uncore counters
PMI_Sel_Core3 — Uncore PMI to core 3
PMI_Sel_Core2 — Uncore PMI to core 2
PMI_Sel_Core1 — Uncore PMI to core 1
PMI_Sel_Core0 — Uncore PMI to core 0
Reserved RESET Value — 00000000_00000000H
Vol. 3B 18-45
PERFORMANCE MONITORING
Additionally, there is also a fixed counter, counting uncore clockticks, for the uncore domain. Table 18-19 summa-
rizes the number MSRs for uncore PMU for each box.
18-46 Vol. 3B
PERFORMANCE MONITORING
Table 18-21. Uncore PMU MSR Summary for Intel® Xeon® Processor E5 Family
Counter General Global
Box # of Boxes Counters per Box Width Purpose Enable Sub-control MSRs
C-Box 8 4 44 Yes per-box None
PCU 1 4 48 Yes per-box Match/Mask
U-Box 1 2 44 Yes uncore None
Details of the uncore performance monitoring facility of Intel Xeon Processor E5 family is available in “Intel®
Xeon® Processor E5 Uncore Performance Monitoring Programming Reference Manual”. The MSR-based uncore
PMU interfaces are listed in Table 2-24.
18.3.5.1 Intel® Xeon® Processor E5 v2 and E7 v2 Family Uncore Performance Monitoring Facility
The uncore subsystem in the Intel Xeon processor E5 v2 and Intel Xeon Processor E7 v2 product families are based
on the Ivy Bridge-E microarchitecture. There are some similarities with those of the Intel Xeon processor E5 family
based on the Sandy Bridge microarchitecture. Within the uncore subsystem, localized performance counter sets
are provided at logic control unit scope.
Details of the uncore performance monitoring facility of Intel Xeon Processor E5 v2 and Intel Xeon Processor E7 v2
families are available in “Intel® Xeon® Processor E5 v2 and E7 v2 Uncore Performance Monitoring Programming
Reference Manual”. The MSR-based uncore PMU interfaces are listed in Table 2-28.
Vol. 3B 18-47
PERFORMANCE MONITORING
18-48 Vol. 3B
PERFORMANCE MONITORING
NOTE
PEBS events are only valid when the following fields of IA32_PERFEVTSELx are all zero: AnyThread,
Edge, Invert, CMask.
Vol. 3B 18-49
PERFORMANCE MONITORING
Table 18-24. PEBS Record Format for 4th Generation Intel Core Processor Family
Byte Offset Field Byte Offset Field
00H R/EFLAGS 60H R10
08H R/EIP 68H R11
10H R/EAX 70H R12
18H R/EBX 78H R13
20H R/ECX 80H R14
28H R/EDX 88H R15
30H R/ESI 90H IA32_PERF_GLOBAL_STATUS
38H R/EDI 98H Data Linear Address
40H R/EBP A0H Data Source Encoding
48H R/ESP A8H Latency value (core cycles)
50H R8 B0H EventingIP
58H R9 B8H TX Abort Information (Section
18.3.6.5.1)
The layout of PEBS records are almost identical to those shown in Table 18-3. Offset B0H is a new field that records
the eventing IP address of the retired instruction that triggered the PEBS assist.
The PEBS records at offsets 98H, A0H, and ABH record data gathered from three of the PEBS capabilities in prior
processor generations: load latency facility (Section 18.3.4.4.2), PDIR (Section 18.3.4.4.4), and the equivalent
capability of precise store in prior generation (see Section 18.3.6.3).
In the core PMU of the 4th generation Intel Core processor, load latency facility and PDIR capabilities are
unchanged. However, precise store is replaced by an enhanced capability, data address profiling, that is not
restricted to store address. Data address profiling also records information in PEBS records at offsets 98H, A0H,
and ABH.
Table 18-25. Precise Events That Supports Data Linear Address Profiling
Event Name Event Name
MEM_UOPS_RETIRED.STLB_MISS_LOADS MEM_UOPS_RETIRED.STLB_MISS_STORES
MEM_UOPS_RETIRED.LOCK_LOADS MEM_UOPS_RETIRED.SPLIT_STORES
MEM_UOPS_RETIRED.SPLIT_LOADS MEM_UOPS_RETIRED.ALL_STORES
MEM_UOPS_RETIRED.ALL_LOADS MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM
MEM_LOAD_UOPS_RETIRED.L1_HIT MEM_LOAD_UOPS_RETIRED.L2_HIT
MEM_LOAD_UOPS_RETIRED.L3_HIT MEM_LOAD_UOPS_RETIRED.L1_MISS
MEM_LOAD_UOPS_RETIRED.L2_MISS MEM_LOAD_UOPS_RETIRED.L3_MISS
MEM_LOAD_UOPS_RETIRED.HIT_LFB MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS
18-50 Vol. 3B
PERFORMANCE MONITORING
Table 18-25. Precise Events That Supports Data Linear Address Profiling (Contd.)
Event Name Event Name
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM
UOPS_RETIRED.ALL (if load or store is tagged) MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE
DataLA can use any one of the IA32_PMC0-IA32_PMC3 counters. Counter overflows will initiate the generation of
PEBS records. Upon counter overflow, hardware captures the linear address and possible other status information
of the retiring memory uop. This information is then written to the PEBS record that is subsequently generated.
To enable the DataLA facility, software must complete the following steps. Please note that the DataLA facility relies
on the PEBS facility, so the PEBS configuration requirements must be completed before attempting to capture
DataLA information.
• Complete the PEBS configuration steps.
• Program an event listed in Table 18-25 using any one of IA32_PERFEVTSEL0-IA32_PERFEVTSEL3.
• Set the corresponding IA32_PEBS_ENABLE.PEBS_EN_CTRx bit. This enables the corresponding IA32_PMCx as
a PEBS counter and enables the DataLA facility.
When the DataLA facility is enabled, the relevant information written into a PEBS record affects entries at offsets
98H, A0H and A8H, as shown in Table 18-26.
Vol. 3B 18-51
PERFORMANCE MONITORING
The supplier information field listed in Table 18-28. The fields vary across products (according to CPUID signatures)
and is noted in the description.
Table 18-28. MSR_OFFCORE_RSP_x Supplier Info Field Definition (CPUID Signature 06_3CH, 06_46H)
Subtype Bit Name Offset Description
Common Any 16 Catch all value for any response types.
Supplier NO_SUPP 17 No Supplier Information available.
Info
L3_HITM 18 M-state initial lookup stat in L3.
L3_HITE 19 E-state
L3_HITS 20 S-state
Reserved 21 Reserved
LOCAL 22 Local DRAM Controller.
Reserved 30:23 Reserved
18-52 Vol. 3B
PERFORMANCE MONITORING
Table 18-29. MSR_OFFCORE_RSP_x Supplier Info Field Definition (CPUID Signature 06_45H)
Subtype Bit Name Offset Description
Common Any 16 Catch all value for any response types.
Supplier NO_SUPP 17 No Supplier Information available.
Info
L3_HITM 18 M-state initial lookup stat in L3.
L3_HITE 19 E-state
L3_HITS 20 S-state
Reserved 21 Reserved
L4_HIT_LOCAL_L4 22 L4 Cache
L4_HIT_REMOTE_HOP0_L4 23 L4 Cache
L4_HIT_REMOTE_HOP1_L4 24 L4 Cache
L4_HIT_REMOTE_HOP2P_L4 25 L4 Cache
Reserved 30:26 Reserved
Vol. 3B 18-53
PERFORMANCE MONITORING
• IN_TX (bit 32): When set, the counter will only include counts that occurred inside a transactional region,
regardless of whether that region was aborted or committed. This bit may only be set if the processor supports
HLE or RTM.
• IN_TXCP (bit 33): When set, the counter will not include counts that occurred inside of an aborted transac-
tional region. This bit may only be set if the processor supports HLE or RTM. This bit may only be set for
IA32_PERFEVTSEL2.
When the IA32_PERFEVTSELx MSR is programmed with both IN_TX=0 and IN_TXCP=0 on a processor that
supports Intel TSX, the result in a counter may include detectable conditions associated with a transaction code
region for its aborted execution (if any) and completed execution.
In the initial implementation, software may need to take pre-caution when using the IN_TXCP bit. See Table 2-29.
63 34 31 24 23 22 21 20 19 18 17 16 15 8 7 0
A I
Counter Mask I E N U
Reserved N N P E O S Unit Mask (UMASK) Event Select
(CMASK) V N Y T C S R
USR—User Mode
OS—Operating system mode
E—Edge detect
PC—Pin control
INT—APIC interrupt enable
ANY—Any Thread
EN—Enable counters
INV—Invert counter mask
IN_TX—In Trans. Rgn
IN_TXCP—In Tx exclude abort (PERFEVTSEL2 Only)
A common usage of setting IN_TXCP=1 is to capture the number of events that were discarded due to a transac-
tional abort. With IA32_PMC2 configured to count in such a manner, then when a transactional region aborts, the
value for that counter is restored to the value it had prior to the aborted transactional region. As a result, any
updates performed to the counter during the aborted transactional region are discarded.
On the other hand, setting IN_TX=1 can be used to drill down on the performance characteristics of transactional
code regions. When a PMCx is configured with the corresponding IA32_PERFEVTSELx.IN_TX=1, only eventing
conditions that occur inside transactional code regions are propagated to the event logic and reflected in the
counter result. Eventing conditions specified by IA32_PERFEVTSELx but occurring outside a transactional region
are discarded.
Additionally, a number of performance events are solely focused on characterizing the execution of Intel TSX trans-
actional code, they are listed in Table 19-12.
18-54 Vol. 3B
PERFORMANCE MONITORING
Pending a PEBS record inside of a transactional region will cause a transactional abort. If a PEBS record was pended
at the time of the abort or on an overflow of the TSX PEBS events listed above, only the following PEBS entries will
be valid (enumerated by PEBS entry offset B8H bits[33:32] to indicate an HLE abort or an RTM abort):
• Offset B0H: EventingIP,
• Offset B8H: TX Abort Information
These fields are set for all PEBS events.
• Offset 08H (RIP/EIP) corresponds to the instruction following the outermost XACQUIRE in HLE or the first
instruction of the fallback handler of the outermost XBEGIN instruction in RTM. This is useful to identify the
aborted transactional region.
In the case of HLE, an aborted transaction will restart execution deterministically at the start of the HLE region. In
the case of RTM, an aborted transaction will transfer execution to the RTM fallback handler.
The layout of the TX Abort Information field is given in Table 18-31.
18.3.6.6 Uncore Performance Monitoring Facilities in the 4th Generation Intel® Core™ Processors
The uncore sub-system in the 4th Generation Intel® Core™ processors provides its own performance monitoring
facility. The uncore PMU facility provides dedicated MSRs to select uncore performance monitoring events in a
similar manner as those described in Section 18.3.4.6.
The ARB unit and each C-Box provide local pairs of event select MSR and counter register. The layout of the event
select MSRs in the C-Boxes are identical as shown in Figure 18-32.
At the uncore domain level, there is a master set of control MSRs that centrally manages all the performance moni-
toring facility of uncore units. Figure 18-33 shows the layout of the uncore domain global control.
Additionally, there is also a fixed counter, counting uncore clockticks, for the uncore domain. Table 18-19 summa-
rizes the number MSRs for uncore PMU for each box.
Vol. 3B 18-55
PERFORMANCE MONITORING
The uncore performance events for the C-Box and ARB units are listed in Table 19-13.
18.3.7 5th Generation Intel® Core™ Processor and Intel® Core™ M Processor Performance
Monitoring Facility
The 5th Generation Intel® Core™ processor and the Intel® Core™ M processor families are based on the Broadwell
microarchitecture. The core PMU supports architectural performance monitoring capability with version ID 3 (see
Section 18.2.3) and a host of non-architectural monitoring capabilities.
Architectural performance monitoring version 3 capabilities are described in Section 18.2.3.
The core PMU has the same capability as those described in Section 18.3.6. IA32_PERF_GLOBAL_STATUS provide
a bit indicator (bit 55) for PMI handler to distinguish PMI due to output buffer overflow condition due to accumu-
lating packet data from Intel Processor Trace.
63 62 61 55 35 34 33 32 31 8 7 6 5 4 3 2 1 0
CondChgd
Ovf_Buffer
Ovf_UncorePMU
Trace_ToPA_PMI
FIXED_CTR2 Overflow (RO)
FIXED_CTR1 Overflow (RO)
FIXED_CTR0 Overflow (RO)
PMC7_OVF (RO, If PMC7 present)
PMC6_OVF (RO, If PMC6 present)
PMC5_OVF (RO, If PMC5 present)
PMC4_OVF (RO, If PMC4 present)
PMC3_OVF (RO)
PMC2_OVF (RO)
PMC1_OVF (RO)
PMC0_OVF (RO)
Reserved Valid if CPUID.0AH:EAX[15:8] = 8; else reserved
Details of Intel Processor Trace is described in Chapter 35, “Intel® Processor Trace”.
IA32_PERF_GLOBAL_OVF_CTRL MSR provide a corresponding reset control bit.
18-56 Vol. 3B
PERFORMANCE MONITORING
63 62 61 55 35 34 33 32 31 8 7 6 5 4 3 2 1 0
ClrCondChgd
ClrOvfDSBuffer
ClrOvfUncore
ClrTraceToPA_PMI
FIXED_CTR2 ClrOverflow
FIXED_CTR1 ClrOverflow
FIXED_CTR0 ClrOverflow
PMC7_ClrOvf (if PMC7 present)
PMC6_ClrOvf (if PMC6 present)
PMC5_ClrOvf (if PMC5 present)
PMC4_ClrOvf (if PMC4 present)
PMC3_ClrOvf
PMC2_ClrOvf
PMC1_ClrOvf
PMC0_ClrOvf
The specifics of non-architectural performance events are listed in Chapter 19, “Performance Monitoring Events”.
18.3.8 6th Generation, 7th Generation and 8th Generation Intel® Core™ Processor
Performance Monitoring Facility
The 6th generation Intel® Core™ processor is based on the Skylake microarchitecture. The 7th generation Intel®
Core™ processor is based on the Kaby Lake microarchitecture. The 8th generation Intel® Core™ processor is based
on the Coffee Lake microarchitecture. For these microarchitectures, the core PMU supports architectural perfor-
mance monitoring capability with version ID 4 (see Section 18.2.4) and a host of non-architectural monitoring
capabilities.
Architectural performance monitoring version 4 capabilities are described in Section 18.2.4.
The core PMU’s capability is similar to those described in Section 18.6.3 through Section 18.3.4.5, with some differ-
ences and enhancements summarized in Table 18-22. Additionally, the core PMU provides some enhancement to
support performance monitoring when the target workload contains instruction streams using Intel® Transactional
Synchronization Extensions (TSX), see Section 18.3.6.5. For details of Intel TSX, see Chapter 16, “Programming with
Intel® Transactional Synchronization Extensions” of Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 1.
Performance monitoring result may be affected by side-band activity on processors that support Intel SGX, details
are described in Chapter 42, “Enclave Code Debug and Profiling”.
Vol. 3B 18-57
PERFORMANCE MONITORING
Counter and Buffer Overflow • Query via • Query via See Section 18.2.4.
Status Management IA32_PERF_GLOBAL_STATUS IA32_PERF_GLOBAL_STATUS
• Reset via • Reset via
IA32_PERF_GLOBAL_STATUS_RESET IA32_PERF_GLOBAL_OVF_CTRL
• Set via
IA32_PERF_GLOBAL_STATUS_SET
IA32_PERF_GLOBAL_STATUS • Individual counter overflow • Individual counter overflow See Section 18.2.4.
Indicators of • PEBS buffer overflow • PEBS buffer overflow
Overflow/Overhead/Interferen • ToPA buffer overflow • ToPA buffer overflow
ce • CTR_Frz, LBR_Frz, ASCI (applicable to Broadwell
microarchitecture)
Enable control in • CTR_Frz NA See Section
IA32_PERF_GLOBAL_STATUS • LBR_Frz 18.2.4.1.
Perfmon Counter In-Use Query IA32_PERF_GLOBAL_INUSE NA See Section
Indicator 18.2.4.3.
18-58 Vol. 3B
PERFORMANCE MONITORING
NOTES
Precise events are only valid when the following fields of IA32_PERFEVTSELx are all zero:
AnyThread, Edge, Invert, CMask.
Vol. 3B 18-59
PERFORMANCE MONITORING
Table 18-35. PEBS Record Format for 6th Generation, 7th Generation
and 8th Generation Intel Core Processor Families
Byte Offset Field Byte Offset Field
00H R/EFLAGS 68H R11
08H R/EIP 70H R12
10H R/EAX 78H R13
18H R/EBX 80H R14
20H R/ECX 88H R15
28H R/EDX 90H Applicable Counter
30H R/ESI 98H Data Linear Address
38H R/EDI A0H Data Source Encoding
40H R/EBP A8H Latency value (core cycles)
48H R/ESP B0H EventingIP
50H R8 B8H TX Abort Information (Section 18.3.6.5.1)
58H R9 C0H TSC
60H R10
The layout of PEBS records are largely identical to those shown in Table 18-24.
The PEBS records at offsets 98H, A0H, and ABH record data gathered from three of the PEBS capabilities in prior
processor generations: load latency facility (Section 18.3.4.4.2), PDIR (Section 18.3.4.4.4), and data address
profiling (Section 18.3.6.3).
In the core PMU of the 6th generation, 7th generation and 8th generation Intel Core processors, load latency facility
and PDIR capabilities and data address profiling are unchanged relative to the 4th generation and 5th generation
Intel Core processors. Similarly, precise store is replaced by data address profiling.
With format 0010b, a snapshot of the IA32_PERF_GLOBAL_STATUS may be useful to resolve the situations when
more than one of IA32_PMICx have been configured to collect PEBS data and two consecutive overflows of the
PEBS-enabled counters are sufficiently far apart in time. It is also possible for the image at 90H to indicate multiple
PEBS-enabled counters have overflowed. In the latter scenario, software cannot to correlate the PEBS record entry
to the multiple overflowed bits.
With PEBS record format encoding 0011b, offset 90H reports the “applicable counter” field, which is a multi-
counter PEBS resolution index allowing software to correlate the PEBS record entry with the eventing PEBS over-
flow when multiple counters are configured to record PEBS records. Additionally, offset C0H captures a snapshot of
the TSC that provides a time line annotation for each PEBS record entry.
18-60 Vol. 3B
PERFORMANCE MONITORING
Table 18-36. Precise Events for the Skylake, Kaby Lake and Coffee Lake Microarchitectures
Event Name Event Select Sub-event UMask
1
INST_RETIRED C0H PREC_DIST 01H
2
ALL_CYCLES 01H
OTHER_ASSISTS C1H ANY 3FH
BR_INST_RETIRED C4H CONDITIONAL 01H
NEAR_CALL 02H
ALL_BRANCHES 04H
NEAR_RETURN 08H
NEAR_TAKEN 20H
FAR_BRACHES 40H
BR_MISP_RETIRED C5H CONDITIONAL 01H
ALL_BRANCHES 04H
NEAR_TAKEN 20H
3>
FRONTEND_RETIRED C6H <Programmable 01H
HLE_RETIRED C8H ABORTED 04H
RTM_RETIRED C9H ABORTED 04H
MEM_INST_RETIRED2 D0H LOCK_LOADS 21H
SPLIT_LOADS 41H
SPLIT_STORES 42H
ALL_LOADS 81H
ALL_STORES 82H
4
MEM_LOAD_RETIRED D1H L1_HIT 01H
L2_HIT 02H
L3_HIT 04H
L1_MISS 08H
L2_MISS 10H
L3_MISS 20H
HIT_LFB 40H
2
MEM_LOAD_L3_HIT_RETIRED D2H XSNP_MISS 01H
XSNP_HIT 02H
XSNP_HITM 04H
XSNP_NONE 08H
NOTES:
1. Only available on IA32_PMC1.
2. INST_RETIRED.ALL_CYCLES is configured with additional parameters of cmask = 10 and INV = 1
3. Subevents are specified using MSR_PEBS_FRONTEND, see Section 18.3.8.2
4. Instruction with at least one load uop experiencing the condition specified in the UMask.
Vol. 3B 18-61
PERFORMANCE MONITORING
18-62 Vol. 3B
PERFORMANCE MONITORING
18.3.8.1.5 FRONTEND_RETIRED
The FRONTEND_RETIRED event is designed to help software developers identify exact instructions that caused
front-end issues. There are some instances in which the event will, by design, the under-counting scenarios include
the following:
• The event counts only retired (non-speculative) front-end events, i.e. events from just true program execution
path are counted.
• The event will count once per cacheline (at most). If a cacheline contains multiple instructions which caused
front-end misses, the count will be only 1 for that line.
• If the multibyte sequence of an instruction spans across two cachelines and causes a miss it will be recorded
once. If there were additional misses in the second cacheline, they will not be counted separately.
• If a multi-uop instruction exceeds the allocation width of one cycle, the bubbles associated with these uops will
be counted once per that instruction.
• If 2 instructions are fused (macro-fusion), and either of them or both cause front-end misses, it will be counted
once for the fused instruction.
• If a front-end (miss) event occurs outside instruction boundary (e.g. due to processor handling of architectural
event), it may be reported for the next instruction to retire.
Vol. 3B 18-63
PERFORMANCE MONITORING
Table 18-41 lists the supplier information field that applies to 6th generation, 7th generation and 8th generation
Intel Core processors. (6th generation Intel Core processor CPUID signatures: 06_4EH, 06_5EH; 7th generation
and 8th generation Intel Core processor CPUID signatures: 06_8EH, 06_9EH).
Table 18-42 lists the snoop information field that apply to processors with CPUID signatures 06_4EH, 06_5EH,
06_8EH, 06_9E, and 06_55H.
18-64 Vol. 3B
PERFORMANCE MONITORING
18.3.8.2.1 Off-core Response Performance Monitoring for the Intel® Xeon® Processor Scalable Family
The following tables list the requestor and supplier information fields that apply to the Intel® Xeon® Processor
Scalable Family.
• Transaction request type encoding (bits 15:0): see Table 18-43.
• Supplier information (bits 29:16): see Table 18-44.
• Supplier information (bits 29:16) with support for Intel® Optane™ DC Persistent Memory support: see
Table 18-45.
• Snoop response information has not been changed and is the same as in (bits 37:30): see Table 18-42.
Vol. 3B 18-65
PERFORMANCE MONITORING
Table 18-43. MSR_OFFCORE_RSP_x Request_Type Definition (Intel® Xeon® Processor Scalable Family)
Bit Name Offset Description
DEMAND_DATA_RD 0 Counts the number of demand data reads and page table entry cacheline reads. Does not count
hw or sw prefetches.
DEMAND_RFO 1 Counts the number of demand reads for ownership (RFO) requests generated by a write to data
cacheline. Does not count L2 RFO prefetches.
DEMAND_CODE_RD 2 Counts the number of demand instruction cacheline reads and L1 instruction cacheline
prefetches.
Reserved 3 Reserved.
PF_L2_DATA_RD 4 Counts the number of prefetch data reads into L2.
PF_L2_RFO 5 Counts the number of RFO Requests generated by the MLC prefetches to L2.
Reserved 6 Reserved.
PF_L3_DATA_RD 7 Counts the number of MLC data read prefetches into L3.
PF_L3_RFO 8 Counts the number of RFO requests generated by MLC prefetches to L3.
Reserved 9 Reserved.
PF_L1D_AND_SW 10 Counts data cacheline reads generated by hardware L1 data cache prefetcher or software
prefetch requests.
Reserved 14:11 Reserved.
OTHER 15 Counts miscellaneous requests, such as I/O and un-cacheable accesses.
Table 18-44 lists the supplier information field that applies to the Intel Xeon Processor Scalable Family (CPUID
signature: 06_55H).
Table 18-44. MSR_OFFCORE_RSP_x Supplier Info Field Definition (CPUID Signature 06_55H)
Subtype Bit Name Offset Description
Common Any 16 Catch all value for any response types.
Supplier SUPPLIER_NONE 17 No Supplier Information available.
Info
L3_HIT_M 18 M-state initial lookup stat in L3.
L3_HIT_E 19 E-state
L3_HIT_S 20 S-state
L3_HIT_F 21 F-state
Reserved 25:22 Reserved
L3_MISS_LOCAL_DRAM 26 L3 Miss: local home requests that missed the L3 cache and were
serviced by local DRAM.
L3_MISS_REMOTE_HOP0_DRAM 27 Hop 0 Remote supplier.
L3_MISS_REMOTE_HOP1_DRAM 28 Hop 1 Remote supplier.
L3_MISS_REMOTE_HOP2P_DRAM 29 Hop 2 or more Remote supplier.
Reserved 30 Reserved
Table 18-45 lists the supplier information field that applies to the Intel Xeon Processor Scalable Family (CPUID
signature: 06_55H, Steppings 0x5H - 0xFH).
18-66 Vol. 3B
PERFORMANCE MONITORING
18.3.8.3 Uncore Performance Monitoring Facilities on Intel® Core™ Processors Based on Cannon Lake
Microarchitecture
Cannon Lake microarchitecture introduces LLC support of up to six processor cores. To support six processor cores
and eight LLC slices, existing MSRs have been rearranged and new CBo MSRs have been added. Uncore perfor-
mance monitoring software drivers from prior generations of Intel Core processors will need to update the MSR
addresses. The new MSRs and updated MSR addresses have been added to the Uncore PMU listing in Section
2.17.2, “MSRs Specific to 8th Generation Intel® Core™ i3 Processors” in Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 4.
Vol. 3B 18-67
PERFORMANCE MONITORING
18-68 Vol. 3B
PERFORMANCE MONITORING
Ice Lake microarchitecture has added a new category of Response subtype, called a Combined Response Info. To
count a feature in this type, all the bits specified must be set to 1.
A valid response type must be a non-zero value of the following expression:
Any | ['OR' of Combined Response Info Bits | [('OR' of Supplier Info Bits) & ('OR' of Snoop Info Bits)]]
If "ANY" bit[16] is set, other response type bits [17-39] are ignored.
Table 18-48 lists the supplier information field that applies to processors based on Ice Lake microarchitecture.
Vol. 3B 18-69
PERFORMANCE MONITORING
Table 18-49 lists the snoop information field that applies to processors based on Ice Lake microarchitecture.
Reserved
31 23 15 7 0
63 55 47 39
MSR_PERF_METRICS
Address: 329H
Scope: Thread
Reset value : 0x00000000 .00000000
This register exposes the four TMA Level 1 metrics. The lower 32 bits are divided into four 8 bit fields, each of which
is an integer fraction of 255.
18-70 Vol. 3B
PERFORMANCE MONITORING
To support built-in performance metrics, new bits have been added to the following MSRs:
• IA32_PERF_GLOBAL_CTRL. EN_PERF_METRICS[48] : If this bit is set and fixed counter 3 is effectively enabled,
built-in performance metrics are enabled.
• IA32_PERF_GLOBAL_STATUS_SET. SET_OVF_PERF_METRICS[48]: If this bit is set, it will set the status bit in
the IA32_PERF_GLOBAL_STATUS register for the PERF_METRICS counters.
• IA32_PERF_GLOBAL_STATUS_RESET. RESET_OVF_PERF_METRICS[48] If this bit is set, it will clear the status
bit in the IA32_PERF_GLOBAL_STATUS register for the PERF_METRICS counters.
• IA32_PERF_GLOBAL_STATUS. OVF_PERF_METRICS[48] - If this bit is set, it indicates that PERF_METRIC
counter has overflowed and a PMI is triggered; however, an overflow of fixed counter 3 should normally happen
first. If this bit is clear no overflow occurred.
18.4.1.1 Enhancements of Performance Monitoring in the Intel® Xeon Phi™ processor Tile
The Intel® Xeon Phi™ processor tile includes the following enhancements to the Silvermont microarchitecture.
• AnyThread support. This facility is limited to following three architectural events: Instructions Retired,
Unhalted Core Cycles, Unhalted Reference Cycles using IA32_FIXED_CTR0-2 and Unhalted Core Cycles,
Unhalted Reference Cycles using IA32_PERFEVTSELx.
Vol. 3B 18-71
PERFORMANCE MONITORING
• PEBS-DLA (Processor Event-Based Sampling-Data Linear Address) fields. The processor provides memory
address in addition to the Silvermont PEBS record support on select events. The PEBS recording format as
reported by IA32_PERF_CAPABILITIES [11:8] is 2.
• Off-core response counting facility. This facility in the processor core allows software to count certain
transaction responses between the processor tile to subsystems outside the tile (untile). Counting off-core
response requires additional event qualification configuration facility in conjunction with IA32_PERFEVTSELx.
Two off-core response MSRs are provided to use in conjunction with specific event codes that must be specified
with IA32_PERFEVTSELx. Two cores do not share the off-core response MSRs. Knights Landing expands off-
core response capability to match the processor untile changes.
• Average request latency measurement. The off-core response counting facility can be combined to use two
performance counters to count the occurrences and weighted cycles of transaction requests. This facility is
updated to match the processor untile changes.
Table 18-50. PEBS Performance Events for the Knights Landing Microarchitecture
Event Name Event Select Sub-event UMask Data Linear
Address Support
BR_INST_RETIRED C4H ALL_BRANCHES 00H No
JCC 7EH No
TAKEN_JCC FEH No
CALL F9H No
REL_CALL FDH No
IND_CALL FBH No
NON_RETURN_IND EBH No
FAR_BRANCH BFH No
RETURN F7H No
BR_MISP_RETIRED C5H ALL_BRANCHES 00H No
JCC 7EH No
TAKEN_JCC FEH No
IND_CALL FBH No
NON_RETURN_IND EBH No
RETURN F7H No
MEM_UOPS_RETIRED 04H L2_HIT_LOADS 02H Yes
L2_MISS_LOADS 04H Yes
DLTB_MISS_LOADS 08H Yes
RECYCLEQ 03H LD_BLOCK_ST_FORWARD 01H Yes
LD_SPLITS 08H Yes
18-72 Vol. 3B
PERFORMANCE MONITORING
The PEBS record format 2 supported by processors based on the Knights Landing microarchitecture is shown in
Table 18-51, and each field in the PEBS record is 64 bits long.
Table 18-51. PEBS Record Format for the Knights Landing Microarchitecture
Byte Offset Field Byte Offset Field
00H R/EFLAGS 60H R10
08H R/EIP 68H R11
10H R/EAX 70H R12
18H R/EBX 78H R13
20H R/ECX 80H R14
28H R/EDX 88H R15
30H R/ESI 90H IA32_PERF_GLOBAL_STATUS
38H R/EDI 98H PSDLA
40H R/EBP A0H Reserved
48H R/ESP A8H Reserved
50H R8 B0H EventingRIP
58H R9 B8H Reserved
Some of the MSR_OFFCORE_RESP [0,1] register bits are not valid in this processor and their use is reserved. The
layout of MSR_OFFCORE_RSP0 and MSR_OFFCORE_RSP1 registers are defined in Table 18-53. Bits 15:0 specifies
the request type of a transaction request to the uncore. Bits 30:16 specifies supplier information, bits 37:31 spec-
ifies snoop response information.
Additionally, MSR_OFFCORE_RSP0 provides bit 38 to enable measurement of average latency of specific type of
offcore transaction requests using two programmable counter simultaneously, see Section 18.5.2.3 for details.
Vol. 3B 18-73
PERFORMANCE MONITORING
18-74 Vol. 3B
PERFORMANCE MONITORING
NOTE
The number of counters available to software may vary from the number of physical counters
present on the hardware, because an agent running at a higher privilege level (e.g., a VMM) may
not expose all counters. CPUID.0AH:EAX[15:8] reports the MSRs available to software; see Section
18.2.1.
Non-architectural performance monitoring in Intel Atom processor family uses the IA32_PERFEVTSELx MSR to
configure a set of non-architecture performance monitoring events to be counted by the corresponding general-
purpose performance counter. The list of non-architectural performance monitoring events is listed in Table 19-31.
Architectural and non-architectural performance monitoring events in 45 nm and 32 nm Intel Atom processors
support thread qualification using bit 21 (AnyThread) of IA32_PERFEVTSELx MSR, i.e. if
IA32_PERFEVTSELx.AnyThread =1, event counts include monitored conditions due to either logical processors in
the same processor core.
Vol. 3B 18-75
PERFORMANCE MONITORING
The bit fields within each IA32_PERFEVTSELx MSR are defined in Figure 18-6 and described in Section 18.2.1.1 and
Section 18.2.3.
Valid event mask (Umask) bits are listed in Chapter 19. The UMASK field may contain sub-fields that provide the
same qualifying actions like those listed in Table 18-71, Table 18-72, Table 18-73, and Table 18-74. One or more of
these sub-fields may apply to specific events on an event-by-event basis. Details are listed in Table 19-31 in
Chapter 19, “Performance Monitoring Events.” Precise Event Based Monitoring is supported using IA32_PMC0 (see
also Section 17.4.9, “BTS and DS Save Area”).
18-76 Vol. 3B
PERFORMANCE MONITORING
Table 18-54. PEBS Performance Events for the Silvermont Microarchitecture (Contd.)
Event Name Event Select Sub-event UMask
IND_CALL FBH
NON_RETURN_IND EBH
FAR_BRANCH BFH
RETURN F7H
BR_MISP_RETIRED C5H ALL_BRANCHES 00H
JCC 7EH
TAKEN_JCC FEH
IND_CALL FBH
NON_RETURN_IND EBH
RETURN F7H
MEM_UOPS_RETIRED 04H L2_HIT_LOADS 02H
L2_MISS_LOADS 04H
DLTB_MISS_LOADS 08H
HITM 20H
REHABQ 03H LD_BLOCK_ST_FORWARD 01H
LD_SPLITS 08H
PEBS Record Format The PEBS record format supported by processors based on the Intel Silvermont microarchi-
tecture is shown in Table 18-55, and each field in the PEBS record is 64 bits long.
Vol. 3B 18-77
PERFORMANCE MONITORING
The layout of MSR_OFFCORE_RSP0 and MSR_OFFCORE_RSP1 are shown in Figure 18-38 and Figure 18-39. Bits
15:0 specifies the request type of a transaction request to the uncore. Bits 30:16 specifies supplier information,
bits 37:31 specifies snoop response information.
Additionally, MSR_OFFCORE_RSP0 provides bit 38 to enable measurement of average latency of specific type of
offcore transaction requests using two programmable counter simultaneously, see Section 18.5.2.3 for details.
63 37 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
18-78 Vol. 3B
PERFORMANCE MONITORING
63 38 37 36 35 34 33 32 31 22 21 20 19 18 17 16
To properly program this extra register, software must set at least one request type bit (Table 18-57) and a valid
response type pattern (Table 18-58, Table 18-59). Otherwise, the event count reported will be zero. It is permis-
sible and useful to set multiple request and response type bits in order to obtain various classes of off-core
response events. Although MSR_OFFCORE_RSPx allow an agent software to program numerous combinations that
meet the above guideline, not all combinations produce meaningful data.
Vol. 3B 18-79
PERFORMANCE MONITORING
To specify a complete offcore response filter, software must properly program bits in the request and response type
fields. A valid request type must have at least one bit set in the non-reserved bits of 15:0. A valid response type
must be a non-zero value of the following expression:
ANY | [(‘OR’ of Supplier Info Bits) & (‘OR’ of Snoop Info Bits)]
If “ANY” bit is set, the supplier and snoop info bits are ignored.
18-80 Vol. 3B
PERFORMANCE MONITORING
Table 18-60. Core PMU Comparison Between the Goldmont and Silvermont Microarchitectures
Box The Goldmont microarchitecture The Silvermont microarchitecture Comment
# of Fixed counters per core 3 3 Use CPUID to determine #
of counters. See Section
18.2.1.
# of general-purpose 4 2 Use CPUID to determine #
counters per core of counters. See Section
18.2.1.
Counter width (R,W) R:48, W: 32/48 R:40, W:32 See Section 18.2.2.
Architectural Performance 4 3 Use CPUID to determine #
Monitoring version ID of counters. See Section
18.2.1.
PMI Overhead Mitigation • Freeze_Perfmon_on_PMI with • Freeze_Perfmon_on_PMI with See Section 17.4.7.
streamlined semantics. legacy semantics. Legacy semantics not
• Freeze_LBR_on_PMI with • Freeze_LBR_on_PMI with legacy
supported with version 4
streamlined semantics for semantics for branch profiling.
branch profiling. or higher.
Counter and Buffer • Query via • Query via See Section 18.2.4.
Overflow Status IA32_PERF_GLOBAL_STATUS IA32_PERF_GLOBAL_STATUS
Management • Reset via • Reset via
IA32_PERF_GLOBAL_STATUS_R IA32_PERF_GLOBAL_OVF_CTRL
ESET
• Set via
IA32_PERF_GLOBAL_STATUS_S
ET
IA32_PERF_GLOBAL_STATU • Individual counter overflow • Individual counter overflow See Section 18.2.4.
S Indicators of • PEBS buffer overflow • PEBS buffer overflow
Overflow/Overhead/Interfer • ToPA buffer overflow
ence • CTR_Frz, LBR_Frz
Processor Event Based General-Purpose Counter 0 only. See Section 18.5.2.1.1. General- IA32_PMC0 only.
Sampling (PEBS) Events Supports all events (precise and Purpose Counter 0 only. Only
non-precise). Precise events are supports precise events (see
listed in Table 18-61. Table 18-54).
PEBS record format 0011b 0010b
encoding
Reduce skid PEBS IA32_PMC0 only No
Data Address Profiling Yes No
PEBS record layout Table 18-62; enhanced fields at Table 18-55.
offsets 90H- 98H; and TSC record
field at C0H.
PEBS EventingIP Yes Yes
Off-core Response Event MSR 1A6H and 1A7H, each core MSR 1A6H and 1A7H, shared by a Nehalem supports 1A6H
has its own register. pair of cores. only.
Vol. 3B 18-81
PERFORMANCE MONITORING
18-82 Vol. 3B
PERFORMANCE MONITORING
The PEBS record format supported by processors based on the Intel Goldmont microarchitecture is shown in
Table 18-62, and each field in the PEBS record is 64 bits long.
Vol. 3B 18-83
PERFORMANCE MONITORING
On Goldmont microarchitecture, all 64 bits of architectural registers are written into the PEBS record regardless of
processor mode.
With PEBS record format encoding 0011b, offset 90H reports the "Applicable Counter" field, which indicates which
counters actually requested generating a PEBS record. This allows software to correlate the PEBS record entry
properly with the instruction that caused the event even when multiple counters are configured to record PEBS
records and multiple bits are set in the field. Additionally, offset C0H captures a snapshot of the TSC that provides
a time line annotation for each PEBS record entry.
18-84 Vol. 3B
PERFORMANCE MONITORING
• Bits 15:0 specifies the request type of a transaction request to the uncore. This is described in Table 18-63.
• Bits 30:16 specifies common supplier information or an L2 Hit, and is described in Table 18-58.
• If L2 misses, then Bits 37:31 can be used to specify snoop response information and is described in
Table 18-64.
• For outstanding requests, bit 38 can enable measurement of average latency of specific type of offcore
transaction requests using two programmable counter simultaneously; see Section 18.5.2.3 for details.
To properly program this extra register, software must set at least one request type bit (Table 18-57) and a valid
response type pattern (either Table 18-58 or Table 18-64). Otherwise, the event count reported will be zero. It is
permissible and useful to set multiple request and response type bits in order to obtain various classes of off-core
response events. Although MSR_OFFCORE_RSPx allow an agent software to program numerous combinations that
meet the above guideline, not all combinations produce meaningful data.
Vol. 3B 18-85
PERFORMANCE MONITORING
To specify a complete offcore response filter, software must properly program bits in the request and response type
fields. A valid request type must have at least one bit set in the non-reserved bits of 15:0. A valid response type
must be a non-zero value of the following expression:
Any_Response Bit | L2 Hit | ‘OR’ of Snoop Info Bits | Outstanding Bit
Table 18-65. Core PMU Comparison Between the Goldmont Plus and Goldmont Microarchitectures
Box Goldmont Plus Microarchitecture Goldmont Microarchitecture Comment
# of Fixed counters per core 3 3 Use CPUID to determine #
of counters. See Section
18.2.1.
18-86 Vol. 3B
PERFORMANCE MONITORING
Table 18-65. Core PMU Comparison Between the Goldmont Plus and Goldmont Microarchitectures
Box Goldmont Plus Microarchitecture Goldmont Microarchitecture Comment
# of general-purpose 4 4 Use CPUID to determine #
counters per core of counters. See Section
18.2.1.
Counter width (R,W) R:48, W: 32/48 R:48, W: 32/48 No change.
Architectural Performance 4 4 No change.
Monitoring version ID
Processor Event Based All General-Purpose and Fixed General-Purpose Counter 0 only. Goldmont Plus supports
Sampling (PEBS) Events counters. Each General-Purpose Supports all events (precise and PEBS on all counters.
counter supports all events (precise non-precise). Precise events are
and non-precise). listed in Table 18-61.
PEBS record format 0011b 0011b No change.
encoding
Table 18-66. Core PMU Comparison Between the Tremont and Goldmont Plus Microarchitectures
Box Tremont Microarchitecture Goldmont Plus Microarchitecture Comment
# of fixed counters per core 3 3 Use CPUID to determine #
of counters. See Section
18.2.1.
# of general-purpose 4 4 Use CPUID to determine #
counters per core of counters. See Section
18.2.1.
Counter width (R,W) R:48, W: 32/48 R:48, W: 32/48 No change. See Section
18.2.2.
Architectural Performance 5 4
Monitoring version ID
Vol. 3B 18-87
PERFORMANCE MONITORING
Table 18-66. Core PMU Comparison Between the Tremont and Goldmont Plus Microarchitectures
Box Tremont Microarchitecture Goldmont Plus Microarchitecture Comment
PEBS record format 0100b 0011b See Section 18.6.2.4.2.
encoding
Reduce skid PEBS IA32_PMC0 and IA32_FIXED_CTR0 IA32_PMC0 only
Extended PEBS Yes Yes See Section 18.5.4.1.
Adaptive PEBS Yes No See Section 18.9.2.
PEBS output DS Save Area or Intel® Processor DS Save Area only See Section 18.5.5.2.1.
Trace
PEBS record layout See Section 18.9.2.3 for output to Table 18-62; enhanced fields at
DS, Section 18.5.5.2.2 for output to offsets 90H- 98H; and TSC record
Intel PT. field at C0H.
Off-core Response Event MSR 1A6H and 1A7H, each core MSR 1A6H and 1A7H, each core has
has its own register, extended its own register.
request and response types.
When PEBS_OUTPUT is set to 01B, the DS Management Area is not used and need not be configured. Instead, the
output mechanism is configured through IA32_RTIT_CTL and other Intel PT MSRs, while counter reload values are
configured in the MSR_RELOAD_PMCx MSRs. Details on configuring Intel PT can be found in Section 35.2.6.
18-88 Vol. 3B
PERFORMANCE MONITORING
63 62 61 60 n 32 31 m 1 0
● ● ● ● ● ●
PEBS_OUTPUT (R/W)
PMI_AFTER_EACH_RECORD (R/W)
PEBS_EN_FIXEDn (R/W)
PEBS_EN_FIXED1 (R/W)
PEBS_EN_FIXED0 (R/W)
PEBS_EN_PMCm (R/W)
PEBS_EN_PMC1 (R/W)
PEBS_EN_PMC0 (R/W)
Figure 18-40. IA32_PEBS_ENABLE MSR with PEBS Output to Intel® Processor Trace
Vol. 3B 18-89
PERFORMANCE MONITORING
18-90 Vol. 3B
PERFORMANCE MONITORING
Vol. 3B 18-91
PERFORMANCE MONITORING
Software must guarantee that all other ResponseType bits are set to 0 when the “Outstanding Requests” bit is
set.
• “Outstanding Requests” bit 63 can enable measurement of the average latency of a specific type of off-core
transaction; two programmable counters must be used simultaneously and the RequestType programming for
MSR_OFFCORE_RSP0 and MSR_OFFCORE_RSP1 must be the same when using this Average Latency feature.
See Section 18.5.2.3 for further details.
18.6.1 Performance Monitoring (Intel® Core™ Solo and Intel® Core™ Duo Processors)
In Intel Core Solo and Intel Core Duo processors, non-architectural performance monitoring events are
programmed using the same facilities (see Figure 18-1) used for architectural performance events.
Non-architectural performance events use event select values that are model-specific. Event mask (Umask) values
are also specific to event logic units. Some microarchitectural conditions detectable by a Umask value may have
specificity related to processor topology (see Section 8.6, “Detecting Hardware Multi-Threading Support and
Topology,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). As a result, the unit
mask field (for example, IA32_PERFEVTSELx[bits 15:8]) may contain sub-fields that specify topology information
of processor cores.
The sub-field layout within the Umask field may support two-bit encoding that qualifies the relationship between a
microarchitectural condition and the originating core. This data is shown in Table 18-71. The two-bit encoding for
core-specificity is only supported for a subset of Umask values (see Chapter 19, “Performance Monitoring Events”)
and for Intel Core Duo processors. Such events are referred to as core-specific events.
Some microarchitectural conditions allow detection specificity only at the boundary of physical processors. Some
bus events belong to this category, providing specificity between the originating physical processor (a bus agent)
versus other agents on the bus. Sub-field encoding for agent specificity is shown in Table 18-72.
Some microarchitectural conditions are detectable only from the originating core. In such cases, unit mask does
not support core-specificity or agent-specificity encodings. These are referred to as core-only conditions.
Some microarchitectural conditions allow detection specificity that includes or excludes the action of hardware
prefetches. A two-bit encoding may be supported to qualify hardware prefetch actions. Typically, this applies only
to some L2 or bus events. The sub-field encoding for hardware prefetch qualification is shown in Table 18-73.
18-92 Vol. 3B
PERFORMANCE MONITORING
Some performance events may (a) support none of the three event-specific qualification encodings (b) may
support core-specificity and agent specificity simultaneously (c) or may support core-specificity and hardware
prefetch qualification simultaneously. Agent-specificity and hardware prefetch qualification are mutually exclusive.
In addition, some L2 events permit qualifications that distinguish cache coherent states. The sub-field definition for
cache coherency state qualification is shown in Table 18-74. If no bits in the MESI qualification sub-field are set for
an event that requires setting MESI qualification bits, the event count will not increment.
Vol. 3B 18-93
PERFORMANCE MONITORING
There are also non-architectural events that support qualification of different types of snoop operation. The corre-
sponding bit field for snoop type qualification are listed in Table 18-76.
No more than one sub-field of MESI, snoop response, and snoop type qualification sub-fields can be supported in a
performance event.
NOTE
Software must write known values to the performance counters prior to enabling the counters. The
content of general-purpose counters and fixed-function counters are undefined after INIT or RESET.
18-94 Vol. 3B
PERFORMANCE MONITORING
63 12 11 9 8 7 5 4 3 2 1 0
P P P
E E E
M M M
N N N
I I I
Reserved
SDM30265
• PMI field (fourth bit in each 4-bit control) — When set, the logical processor generates an exception
through its local APIC on overflow condition of the respective fixed-function counter.
MSR_PERF_GLOBAL_CTRL MSR provides single-bit controls to enable counting in each performance counter (see
Figure 18-42). Each enable bit in MSR_PERF_GLOBAL_CTRL is AND’ed with the enable bits for all privilege levels in
the respective IA32_PERFEVTSELx or IA32_FIXED_CTR_CTRL MSRs to start/stop the counting of respective coun-
ters. Counting is enabled if the AND’ed results is true; counting is disabled when the result is false.
63 35 34 33 32 31 2 1 0
FIXED_CTR2 enable
FIXED_CTR1 enable
FIXED_CTR0 enable
PMC1 enable
PMC0 enable
Reserved
Vol. 3B 18-95
PERFORMANCE MONITORING
MSR_PERF_GLOBAL_STATUS MSR provides single-bit status used by software to query the overflow condition of
each performance counter. MSR_PERF_GLOBAL_STATUS[bit 62] indicates overflow conditions of the DS area data
buffer. MSR_PERF_GLOBAL_STATUS[bit 63] provides a CondChgd bit to indicate changes to the state of perfor-
mance monitoring hardware (see Figure 18-43). A value of 1 in bits 34:32, 1, 0 indicates an overflow condition has
occurred in the associated counter.
63 62 35 34 33 32 31 2 1 0
CondChgd
OvfBuffer
FIXED_CTR2 Overflow
FIXED_CTR1 Overflow
FIXED_CTR0 Overflow
PMC1 Overflow
PMC0 Overflow
Reserved
When a performance counter is configured for PEBS, an overflow condition in the counter will arm PEBS. On the
subsequent event following overflow, the processor will generate a PEBS event. On a PEBS event, the processor will
perform bounds checks based on the parameters defined in the DS Save Area (see Section 17.4.9). Upon
successful bounds checks, the processor will store the data record in the defined buffer area, clear the counter
overflow status, and reload the counter. If the bounds checks fail, the PEBS will be skipped entirely. In the event
that the PEBS buffer fills up, the processor will set the OvfBuffer bit in MSR_PERF_GLOBAL_STATUS.
MSR_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow the indicators for general-purpose or fixed-
function counters via a single WRMSR (see Figure 18-44). Clear overflow indications when:
• Setting up new values in the event select and/or UMASK field for counting or interrupt-based event sampling.
• Reloading counter values to continue collecting next sample.
• Disabling event counting or interrupt-based event sampling.
63 62 35 34 33 32 31 2 1 0
ClrCondChgd
ClrOvfBuffer
FIXED_CTR2 ClrOverflow
FIXED_CTR1 ClrOverflow
FIXED_CTR0 ClrOverflow
PMC1 ClrOverflow
PMC0 ClrOverflow
Reserved
18-96 Vol. 3B
PERFORMANCE MONITORING
architecture does not require special MSR programming control (see Section 18.6.3.6, “At-Retirement Counting”),
but is limited to IA32_PMC0. See Table 18-77 for a list of events available to processors based on Intel Core micro-
architecture.
Vol. 3B 18-97
PERFORMANCE MONITORING
18-98 Vol. 3B
PERFORMANCE MONITORING
Vol. 3B 18-99
PERFORMANCE MONITORING
• The MSR_PEBS_ENABLE MSR, which enables the PEBS facilities and replay tagging used in at-retirement event
counting.
• A set of predefined events and event metrics that simplify the setting up of the performance counters to count
specific events.
Table 18-80 lists the performance counters and their associated CCCRs, along with the ESCRs that select events to
be counted for each performance counter. Predefined event metrics and events are listed in Chapter 19, “Perfor-
mance Monitoring Events.”
18-100 Vol. 3B
PERFORMANCE MONITORING
Vol. 3B 18-101
PERFORMANCE MONITORING
The types of events that can be counted with these performance monitoring facilities are divided into two classes:
non-retirement events and at-retirement events.
• Non-retirement events (see Table 19-33) are events that occur any time during instruction execution (such as
bus transactions or cache transactions).
• At-retirement events (see Table 19-34) are events that are counted at the retirement stage of instruction
execution, which allows finer granularity in counting events and capturing machine state.
The at-retirement counting mechanism includes facilities for tagging μops that have encountered a particular
performance event during instruction execution. Tagging allows events to be sorted between those that
occurred on an execution path that resulted in architectural state being committed at retirement as well as
events that occurred on an execution path where the results were eventually cancelled and never committed to
architectural state (such as, the execution of a mispredicted branch).
The Pentium 4 and Intel Xeon processor performance monitoring facilities support the three usage models
described below. The first two models can be used to count both non-retirement and at-retirement events; the
third model is used to count a subset of at-retirement events:
• Event counting — A performance counter is configured to count one or more types of events. While the
counter is counting, software reads the counter at selected intervals to determine the number of events that
have been counted between the intervals.
• Interrupt-based event sampling — A performance counter is configured to count one or more types of
events and to generate an interrupt when it overflows. To trigger an overflow, the counter is preset to a
modulus value that will cause the counter to overflow after a specific number of events have been counted.
When the counter overflows, the processor generates a performance monitoring interrupt (PMI). The interrupt
service routine for the PMI then records the return instruction pointer (RIP), resets the modulus, and restarts
the counter. Code performance can be analyzed by examining the distribution of RIPs with a tool like the
VTune™ Performance Analyzer.
• Processor event-based sampling (PEBS) — In PEBS, the processor writes a record of the architectural
state of the processor to a memory buffer after the counter overflows. The records of architectural state
provide additional information for use in performance tuning. Processor-based event sampling can be used to
count only a subset of at-retirement events. PEBS captures more precise processor state information compared
to interrupt based event sampling, because the latter need to use the interrupt service routine to re-construct
the architectural states of processor.
The following sections describe the MSRs and data structures used for performance monitoring in the Pentium 4
and Intel Xeon processors.
18-102 Vol. 3B
PERFORMANCE MONITORING
• USR flag, bit 2 — When set, events are counted when the processor is operating at a current privilege level
(CPL) of 1, 2, or 3. These privilege levels are generally used by application code and unprotected operating
system code.
• OS flag, bit 3 — When set, events are counted when the processor is operating at CPL of 0. This privilege level
is generally reserved for protected operating system code. (When both the OS and USR flags are set, events
are counted at all privilege levels.)
31 30 25 24 9 8 5 4 3 2 1 0
Event Tag
Event Mask
Select Value
Tag Enable
Reserved OS
USR
63 32
Reserved
• Tag enable, bit 4 — When set, enables tagging of μops to assist in at-retirement event counting; when clear,
disables tagging. See Section 18.6.3.6, “At-Retirement Counting.”
• Tag value field, bits 5 through 8 — Selects a tag value to associate with a μop to assist in at-retirement
event counting.
• Event mask field, bits 9 through 24 — Selects events to be counted from the event class selected with the
event select field.
• Event select field, bits 25 through 30) — Selects a class of events to be counted. The events within this
class that are counted are selected with the event mask field.
When setting up an ESCR, the event select field is used to select a specific class of events to count, such as retired
branches. The event mask field is then used to select one or more of the specific events within the class to be
counted. For example, when counting retired branches, four different events can be counted: branch not taken
predicted, branch not taken mispredicted, branch taken predicted, and branch taken mispredicted. The OS and
USR flags allow counts to be enabled for events that occur when operating system code and/or application code are
being executed. If neither the OS nor USR flag is set, no events will be counted.
The ESCRs are initialized to all 0s on reset. The flags and fields of an ESCR are configured by writing to the ESCR
using the WRMSR instruction. Table 18-80 gives the addresses of the ESCR MSRs.
Writing to an ESCR MSR does not enable counting with its associated performance counter; it only selects the event
or events to be counted. The CCCR for the selected performance counter must also be configured. Configuration of
the CCCR includes selecting the ESCR and enabling the counter.
Vol. 3B 18-103
PERFORMANCE MONITORING
31 0
Counter
63 39 32
Reserved Counter
The RDPMC instruction is not serializing or ordered with other instructions. Thus, it does not necessarily wait until
all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may
begin execution before the RDPMC instruction operation is performed.
Only the operating system, executing at privilege level 0, can directly manipulate the performance counters, using
the RDMSR and WRMSR instructions. A secure operating system would clear the PCE flag during system initializa-
tion to disable direct user access to the performance-monitoring counters, but provide a user-accessible program-
ming interface that emulates the RDPMC instruction.
Some uses of the performance counters require the counters to be preset before counting begins (that is, before
the counter is enabled). This can be accomplished by writing to the counter using the WRMSR instruction. To set a
counter to a specified number of counts before overflow, enter a 2s complement negative integer in the counter.
The counter will then count from the preset value up to -1 and overflow. Writing to a performance counter in a
Pentium 4 or Intel Xeon processor with the WRMSR instruction causes all 40 bits of the counter to be written.
18-104 Vol. 3B
PERFORMANCE MONITORING
Reserved
31 30 29 27 26 25 24 23 20 19 18 17 16 15 13 12 11 0
Threshold ESCR
Reserved
Reserved
Select
Enable
Reserved: Must be set to 11B
Compare
Complement
Edge
FORCE_OVF
OVF_PMI
Cascade
OVF
63 32
Reserved
• FORCE_OVF flag, bit 25 — When set, forces a counter overflow on every counter increment; when clear,
overflow only occurs when the counter actually overflows.
• OVF_PMI flag, bit 26 — When set, causes a performance monitor interrupt (PMI) to be generated when the
counter overflows occurs; when clear, disables PMI generation. Note that the PMI is generated on the next
event count after the counter has overflowed.
• Cascade flag, bit 30 — When set, enables counting on one counter of a counter pair when its alternate
counter in the other the counter pair in the same counter group overflows (see Section 18.6.3.2, “Performance
Counters,” for further details); when clear, disables cascading of counters.
Vol. 3B 18-105
PERFORMANCE MONITORING
• OVF flag, bit 31 — Indicates that the counter has overflowed when set. This flag is a sticky flag that must be
explicitly cleared by software.
The CCCRs are initialized to all 0s on reset.
The events that an enabled performance counter actually counts are selected and filtered by the following flags and
fields in the ESCR and CCCR registers and in the qualification order given:
1. The event select and event mask fields in the ESCR select a class of events to be counted and one or more event
types within the class, respectively.
2. The OS and USR flags in the ESCR selected the privilege levels at which events will be counted.
3. The ESCR select field of the CCCR selects the ESCR. Since each counter has several ESCRs associated with it,
one ESCR must be chosen to select the classes of events that may be counted.
4. The compare and complement flags and the threshold field of the CCCR select an optional threshold to be used
in qualifying an event count.
5. The edge flag in the CCCR allows events to be counted only on rising-edge transitions.
The qualification order in the above list implies that the filtered output of one “stage” forms the input for the next.
For instance, events filtered using the privilege level flags can be further qualified by the compare and complement
flags and the threshold field, and an event that matched the threshold criteria, can be further qualified by edge
detection.
The uses of the flags and fields in the CCCRs are discussed in greater detail in Section 18.6.3.5, “Programming the
Performance Counters for Non-Retirement Events.”
18-106 Vol. 3B
PERFORMANCE MONITORING
For Table 19-33 and Table 19-34 in Chapter 19, the name of the event is listed in the Event Name column and
parameters that define the event and other information are listed in the Event Parameters column. The Parameter
Value and Description columns give specific parameters for the event and additional description information.
Entries in the Event Parameters column are described below.
• ESCR restrictions — Lists the ESCRs that can be used to program the event. Typically only one ESCR is
needed to count an event.
• Counter numbers per ESCR — Lists which performance counters are associated with each ESCR. Table 18-80
gives the name of the counter and CCCR for each counter number. Typically only one counter is needed to count
the event.
• ESCR event select — Gives the value to be placed in the event select field of the ESCR to select the event.
• ESCR event mask — Gives the value to be placed in the Event Mask field of the ESCR to select sub-events to
be counted. The parameter value column defines the documented bits with relative bit position offset starting
from 0, where the absolute bit position of relative offset 0 is bit 9 of the ESCR. All undocumented bits are
reserved and should be set to 0.
• CCCR select — Gives the value to be placed in the ESCR select field of the CCCR associated with the counter
to select the ESCR to be used to define the event. This value is not the address of the ESCR; it is the number of
the ESCR from the Number column in Table 18-80.
• Event specific notes — Gives additional information about the event, such as the name of the same or a
similar event defined for the P6 family processors.
• Can support PEBS — Indicates if PEBS is supported for the event (only supplied for at-retirement events
listed in Table 19-34.)
• Requires additional MSR for tagging — Indicates which if any additional MSRs must be programmed to
count the events (only supplied for the at-retirement events listed in Table 19-34.)
Vol. 3B 18-107
PERFORMANCE MONITORING
NOTE
The performance-monitoring events listed in Chapter 19, “Performance Monitoring Events,” are
intended to be used as guides for performance tuning. The counter values reported are not
guaranteed to be absolutely accurate and should be used as a relative guide for tuning. Known
discrepancies are documented where applicable.
The following procedure shows how to set up a performance counter for basic counting; that is, the counter is set
up to count a specified event indefinitely, wrapping around whenever it reaches its maximum count. This procedure
is continued through the following four sections.
Using information in Table 19-33, Chapter 19, an event to be counted can be selected as follows:
1. Select the event to be counted.
2. Select the ESCR to be used to select events to be counted from the ESCRs field.
3. Select the number of the counter to be used to count the event from the Counter Numbers Per ESCR field.
4. Determine the name of the counter and the CCCR associated with the counter, and determine the MSR
addresses of the counter, CCCR, and ESCR from Table 18-80.
5. Use the WRMSR instruction to write the ESCR Event Select and ESCR Event Mask values into the appropriate
fields in the ESCR. At the same time set or clear the USR and OS flags in the ESCR as desired.
6. Use the WRMSR instruction to write the CCCR Select value into the appropriate field in the CCCR.
NOTE
Typically all the fields and flags of the CCCR will be written with one WRMSR instruction; however,
in this procedure, several WRMSR writes are used to more clearly demonstrate the uses of the
various CCCR fields and flags.
This setup procedure is continued in the next section, Section 18.6.3.5.2, “Filtering Events.”
18-108 Vol. 3B
PERFORMANCE MONITORING
The following procedure shows how to configure a CCCR to filter events using the threshold filter and the edge
filter. This procedure is a continuation of the setup procedure introduced in Section 18.6.3.5.1, “Selecting Events
to Count.”
7. (Optional) To set up the counter for threshold filtering, use the WRMSR instruction to write values in the CCCR
compare and complement flags and the threshold field:
— Set the compare flag.
— Set or clear the complement flag for less than or equal to or greater than comparisons, respectively.
— Enter a value from 0 to 15 in the threshold field.
8. (Optional) Select rising edge filtering by setting the CCCR edge flag.
This setup procedure is continued in the next section, Section 18.6.3.5.3, “Starting Event Counting.”
Processor Clock
Output from
Threshold Filter
Counter Increments
On Rising Edge
(False-to-True)
Vol. 3B 18-109
PERFORMANCE MONITORING
around, it sets its OVF flag to indicate that the counter has overflowed. The OVF flag is a sticky flag that indicates
that the counter has overflowed at least once since the OVF bit was last cleared.
To halt counting, the CCCR enable flag for the counter must be cleared.
The following procedural step shows how to stop event counting. This step is a continuation of the setup procedure
introduced in Section 18.6.3.5.4, “Reading a Performance Counter’s Count.”
11. To stop event counting, execute a WRMSR instruction to clear the CCCR enable flag for the performance
counter.
To halt a cascaded counter (a counter that was started when its alternate counter overflowed), either clear the
Cascade flag in the cascaded counter’s CCCR MSR or clear the OVF flag in the alternate counter’s CCCR MSR.
18-110 Vol. 3B
PERFORMANCE MONITORING
MSR_IQ_CCCR1|2:11 Reserved
The extended cascading feature can be adapted to the Interrupt based sampling usage model for performance
monitoring. However, it is known that performance counters do not generate PMI in cascade mode or extended
cascade mode due to an erratum. This erratum applies to processors with CPUID DisplayFamily_DisplayModel
signature of 0F_02. For processors with CPUID DisplayFamily_DisplayModel signature of 0F_00 and 0F_01, the
erratum applies to processors with stepping encoding greater than 09H.
Counters 16 and 17 in the IQ block are frequently used in processor event-based sampling or at-retirement
counting of events indicating a stalled condition in the pipeline. Neither counter 16 or 17 can initiate the cascading
of counter pairs using the cascade bit in a CCCR.
Extended cascading permits performance monitoring tools to use counters 16 and 17 to initiate cascading of two
counters in the IQ block. Extended cascading from counter 16 and 17 is conceptually similar to cascading other
counters, but instead of using CASCADE bit of a CCCR, one of the four CASCNTxINTOy bits is used.
Vol. 3B 18-111
PERFORMANCE MONITORING
when overflow occurred. This information can then be used with a tool like the Intel® VTune™ Performance
Analyzer to analyze and tune program performance.
To enable an interrupt on counter overflow, the OVR_PMI flag in the counter’s associated CCCR MSR must be set.
When overflow occurs, a PMI is generated through the local APIC. (Here, the performance counter entry in the local
vector table [LVT] is set up to deliver the interrupt generated by the PMI to the processor.)
The PMI service routine can use the OVF flag to determine which counter overflowed when multiple counters have
been configured to generate PMIs. Also, note that these processors mask PMIs upon receiving an interrupt. Clear
this condition before leaving the interrupt handler.
When generating interrupts on overflow, the performance counter being used should be preset to value that will
cause an overflow after a specified number of events are counted plus 1. The simplest way to select the preset
value is to write a negative number into the counter, as described in Section 18.6.3.5.6, “Cascading Counters.”
Here, however, if an interrupt is to be generated after 100 event counts, the counter should be preset to minus 100
plus 1 (-100 + 1), or -99. The counter will then overflow after it counts 99 events and generate an interrupt on the
next (100th) event counted. The difference of 1 for this count enables the interrupt to be generated immediately
after the selected event count has been reached, instead of waiting for the overflow to be propagation through the
counter.
Because of latency in the microarchitecture between the generation of events and the generation of interrupts on
overflow, it is sometimes difficult to generate an interrupt close to an event that caused it. In these situations, the
FORCE_OVF flag in the CCCR can be used to improve reporting. Setting this flag causes the counter to overflow on
every counter increment, which in turn triggers an interrupt after every counter increment.
18-112 Vol. 3B
PERFORMANCE MONITORING
event. For example, a μop may encounter a cache miss more than once during its life time, but a “Miss Retired”
metric (that counts the number of retired μops that encountered a cache miss) will increment only once for that
μop. A “Miss Retired” metric would be useful for characterizing the performance of the cache hierarchy for a
particular instruction sequence. Details of various performance metrics and how these can be constructed
using the Pentium 4 and Intel Xeon processors performance events are provided in the Intel Pentium 4
Processor Optimization Reference Manual (see Section 1.4, “Related Literature”).
• Replay — To maximize performance for the common case, the Intel NetBurst microarchitecture aggressively
schedules μops for execution before all the conditions for correct execution are guaranteed to be satisfied. In
the event that all of these conditions are not satisfied, μops must be reissued. The mechanism that the Pentium
4 and Intel Xeon processors use for this reissuing of μops is called replay. Some examples of replay causes are
cache misses, dependence violations, and unforeseen resource constraints. In normal operation, some number
of replays is common and unavoidable. An excessive number of replays is an indication of a performance
problem.
• Assist — When the hardware needs the assistance of microcode to deal with some event, the machine takes
an assist. One example of this is an underflow condition in the input operands of a floating-point operation. The
hardware must internally modify the format of the operands in order to perform the computation. Assists clear
the entire machine of μops before they begin and are costly.
Vol. 3B 18-113
PERFORMANCE MONITORING
Table 19-34 describes the Front_end_event and Table 19-36 describes metrics that are used to set up a
Front_end_event count.
The MSRs specified in the Table 19-34 that are supported by the front-end tagging mechanism must be set and one
or both of the NBOGUS and BOGUS bits in the Front_end_event event mask must be set to count events. None of
the events currently supported requires the use of the MSR_TC_PRECISE_EVENT MSR.
18-114 Vol. 3B
PERFORMANCE MONITORING
generated, allowing the precise event records to be saved. A circular buffer is not supported for precise event
records.
PEBS is supported only for a subset of the at-retirement events: Execution_event, Front_end_event, and
Replay_event. Also, PEBS can only be carried out using the one performance counter, the MSR_IQ_COUNTER4
MSR.
In processors based on Intel Core microarchitecture, a similar PEBS mechanism is also supported using
IA32_PMC0 and IA32_PERFEVTSEL0 MSRs (See Section 18.6.2.4).
Vol. 3B 18-115
PERFORMANCE MONITORING
Depending upon intended usage, the instruction pointers that are part of the branch records or the PEBS records
need to have an association with the corresponding process. One solution requires the ability for the DS specific
operating system module to be chained to the context switch. A separate buffer can then be maintained for each
process of interest and the MSR pointing to the configuration area saved and setup appropriately on each context
switch.
If the BTS facility has been enabled, then it must be disabled and state stored on transition of the system to a sleep
state in which processor context is lost. The state must be restored on return from the sleep state.
It is required that an interrupt gate be used for the DS interrupt as opposed to a trap gate to prevent the generation
of an endless interrupt loop.
Pages that contain buffers must have mappings to the same physical address for all processes/logical processors,
such that any change to CR3 will not change DS addresses. If this requirement cannot be satisfied (that is, the
feature is enabled on a per thread/process basis), then the operating system must ensure that the feature is
enabled/disabled appropriately in the context switch code.
31 30 25 24 9 8 5 4 3 2 1 0
Event Tag
Select Event Mask
Value
Tag Enable
T0_OS
Reserved T0_USR
T1_OS
T1_USR
63 32
Reserved
Figure 18-49. Event Selection Control Register (ESCR) for the Pentium 4 Processor, Intel Xeon Processor
and Intel Xeon Processor MP Supporting Hyper-Threading Technology
18-116 Vol. 3B
PERFORMANCE MONITORING
• T1_OS flag, bit 1 — When set, events are counted when thread 1 (logical processor 1) is executing at CPL of
0. This privilege level is generally reserved for protected operating system code. (When both the T1_OS and
T1_USR flags are set, thread 1 events are counted at all privilege levels.)
• T0_USR flag, bit 2 — When set, events are counted when thread 0 (logical processor 0) is executing at a CPL
of 1, 2, or 3.
• T0_OS flag, bit 3 — When set, events are counted when thread 0 (logical processor 0) is executing at CPL of
0. (When both the T0_OS and T0_USR flags are set, thread 0 events are counted at all privilege levels.)
• Tag enable, bit 4 — When set, enables tagging of μops to assist in at-retirement event counting; when clear,
disables tagging. See Section 18.6.3.6, “At-Retirement Counting.”
• Tag value field, bits 5 through 8 — Selects a tag value to associate with a μop to assist in at-retirement
event counting.
• Event mask field, bits 9 through 24 — Selects events to be counted from the event class selected with the
event select field.
• Event select field, bits 25 through 30) — Selects a class of events to be counted. The events within this
class that are counted are selected with the event mask field.
The T0_OS and T0_USR flags and the T1_OS and T1_USR flags allow event counting and sampling to be specified
for a specific logical processor (0 or 1) within an Intel Xeon processor MP (See also: Section 8.4.5, “Identifying
Logical Processors in an MP System,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3A).
Not all performance monitoring events can be detected within an Intel Xeon processor MP on a per logical processor
basis (see Section 18.6.4.4, “Performance Monitoring Events”). Some sub-events (specified by an event mask bits)
are counted or sampled without regard to which logical processor is associated with the detected event.
• Compare flag, bit 18 — When set, enables filtering of the event count; when clear, disables filtering. The
filtering method is selected with the threshold, complement, and edge flags.
Vol. 3B 18-117
PERFORMANCE MONITORING
Reserved
31 30 29 27 26 25 24 23 20 19 18 17 16 15 13 12 11 0
Threshold ESCR
Reserved
Reserved
Select
Enable
Active Thread
Compare
Complement
Edge
FORCE_OVF
OVF_PMI_T0
OVF_PMI_T1
Cascade
OVF
63 32
Reserved
• Complement flag, bit 19 — Selects how the incoming event count is compared with the threshold value.
When set, event counts that are less than or equal to the threshold value result in a single count being delivered
to the performance counter; when clear, counts greater than the threshold value result in a count being
delivered to the performance counter (see Section 18.6.3.5.2, “Filtering Events”). The compare flag is not
active unless the compare flag is set.
• Threshold field, bits 20 through 23 — Selects the threshold value to be used for comparisons. The
processor examines this field only when the compare flag is set, and uses the complement flag setting to
determine the type of threshold comparison to be made. The useful range of values that can be entered in this
field depend on the type of event being counted (see Section 18.6.3.5.2, “Filtering Events”).
• Edge flag, bit 24 — When set, enables rising edge (false-to-true) edge detection of the threshold comparison
output for filtering event counts; when clear, rising edge detection is disabled. This flag is active only when the
compare flag is set.
• FORCE_OVF flag, bit 25 — When set, forces a counter overflow on every counter increment; when clear,
overflow only occurs when the counter actually overflows.
• OVF_PMI_T0 flag, bit 26 — When set, causes a performance monitor interrupt (PMI) to be sent to logical
processor 0 when the counter overflows occurs; when clear, disables PMI generation for logical processor 0.
Note that the PMI is generate on the next event count after the counter has overflowed.
• OVF_PMI_T1 flag, bit 27 — When set, causes a performance monitor interrupt (PMI) to be sent to logical
processor 1 when the counter overflows occurs; when clear, disables PMI generation for logical processor 1.
Note that the PMI is generate on the next event count after the counter has overflowed.
• Cascade flag, bit 30 — When set, enables counting on one counter of a counter pair when its alternate
counter in the other the counter pair in the same counter group overflows (see Section 18.6.3.2, “Performance
Counters,” for further details); when clear, disables cascading of counters.
• OVF flag, bit 31 — Indicates that the counter has overflowed when set. This flag is a sticky flag that must be
explicitly cleared by software.
18-118 Vol. 3B
PERFORMANCE MONITORING
processor ID(T0 or T1); instead, they allow a software agent to enable PEBS for subsequent threads of execution
on the same logical processor on which the agent is running (“my thread”) or for the other logical processor in the
physical package on which the agent is not running (“other thread”).
PEBS is supported for only a subset of the at-retirement events: Execution_event, Front_end_event, and
Replay_event. Also, PEBS can be carried out only with two performance counters: MSR_IQ_CCCR4 (MSR address
370H) for logical processor 0 and MSR_IQ_CCCR5 (MSR address 371H) for logical processor 1.
Performance monitoring tools should use a processor affinity mask to bind the kernel mode components that need
to modify the ENABLE_PEBS_MY_THR and ENABLE_PEBS_OTH_THR bits in the MSR_PEBS_ENABLE MSR to a
specific logical processor. This is to prevent these kernel mode components from migrating between different
logical processors due to OS scheduling.
Vol. 3B 18-119
PERFORMANCE MONITORING
When a bit in the event mask field is TI, the effect of specifying bit-0-3 of the associated ESCR are described in
Table 15-6. For events that are marked as TI in Chapter 19, the effect of selectively specifying T0_USR, T0_OS,
T1_USR, T1_OS bits is shown in Table 18-84.
18.6.4.5 Counting Clocks on systems with Intel Hyper-Threading Technology in Processors Based on
Intel NetBurst® Microarchitecture
18-120 Vol. 3B
PERFORMANCE MONITORING
3. Turn threshold comparison on in the CCCR by setting the compare bit to “1”.
4. Set the threshold to “15” and the complement to “1” in the CCCR. Since no event can exceed this threshold, the
threshold condition is met every cycle and the counter counts every cycle. Note that this overrides any qualifi-
cation (e.g. by CPL) specified in the ESCR.
5. Enable counting in the CCCR for the counter by setting the enable bit.
In most cases, the counts produced by the non-halted and non-sleep metrics are equivalent if the physical package
supports one logical processor and is not placed in a power-saving state. Operating systems may execute an HLT
instruction and place a physical processor in a power-saving state.
On processors that support Intel Hyper-Threading Technology (Intel HT Technology), each physical package can
support two or more logical processors. Current implementation of Intel HT Technology provides two logical
processors for each physical processor. While both logical processors can execute two threads simultaneously, one
logical processor may halt to allow the other logical processor to execute without sharing execution resources
between two logical processors.
Non-halted Clockticks can be set up to count the number of processor clock cycles for each logical processor when-
ever the logical processor is not halted (the count may include some portion of the clock cycles for that logical
processor to complete a transition to a halted state). Physical processors that support Intel HT Technology enter
into a power-saving state if all logical processors halt.
The Non-sleep Clockticks mechanism uses a filtering mechanism in CCCRs. The mechanism will continue to incre-
ment as long as one logical processor is not halted or in a power-saving state. Applications may cause a processor
to enter into a power-saving state by using an OS service that transfers control to an OS's idle loop. The idle loop
then may place the processor into a power-saving state after an implementation-dependent period if there is no
work for the processor.
Vol. 3B 18-121
PERFORMANCE MONITORING
System Bus
iFSB
Processor Core
IOQ (Front end, Execution,
Retirement, L1, L2
Figure 18-51. Block Diagram of 64-bit Intel Xeon Processor MP with 8-MByte L3
Additional performance monitoring capabilities and facilities unique to 64-bit Intel Xeon processor MP with an L3
cache are described in this section. The facility for monitoring events consists of a set of dedicated model-specific
registers (MSRs), each dedicated to a specific event. Programming of these MSRs requires using RDMSR/WRMSR
instructions with 64-bit values.
The lower 32-bits of the MSRs at addresses 107CC through 107D3 are treated as 32 bit performance counter regis-
ters. These performance counters can be accessed using RDPMC instruction with the index starting from 18
through 25. The EDX register returns zero when reading these 8 PMCs.
The performance monitoring capabilities consist of four events. These are:
• IBUSQ event — This event detects the occurrence of micro-architectural conditions related to the iBUSQ unit.
It provides two MSRs: MSR_IFSB_IBUSQ0 and MSR_IFSB_IBUSQ1. Configure sub-event qualification and
enable/disable functions using the high 32 bits of these MSRs. The low 32 bits act as a 32-bit event counter.
Counting starts after software writes a non-zero value to one or more of the upper 32 bits. See Figure 18-52.
1 1
32 bit event count
Saturate
Fill_match Reserved
Eviction_match
L3_state_match
Snoop_match
Type_match
T1_match
T0_match
• ISNPQ event — This event detects the occurrence of microarchitectural conditions related to the iSNPQ unit.
It provides two MSRs: MSR_IFSB_ISNPQ0 and MSR_IFSB_ISNPQ1. Configure sub-event qualifications and
enable/disable functions using the high 32 bits of the MSRs. The low 32-bits act as a 32-bit event counter.
Counting starts after software writes a non-zero value to one or more of the upper 32-bits. See Figure 18-53.
18-122 Vol. 3B
PERFORMANCE MONITORING
Saturate
L3_state_match
Snoop_match
Type_match
Agent_match
T1_match
T0_match
31 0
• EFSB event — This event can detect the occurrence of micro-architectural conditions related to the iFSB unit
or system bus. It provides two MSRs: MSR_EFSB_DRDY0 and MSR_EFSB_DRDY1. Configure sub-event qualifi-
cations and enable/disable functions using the high 32 bits of the 64-bit MSR. The low 32-bit act as a 32-bit
event counter. Counting starts after software writes a non-zero value to one or more of the qualification bits in
the upper 32-bits of the MSR. See Figure 18-54.
Saturate
Other Reserved
Own
31 0
• IBUSQ Latency event — This event accumulates weighted cycle counts for latency measurement of transac-
tions in the iBUSQ unit. The count is enabled by setting MSR_IFSB_CTRL6[bit 26] to 1; the count freezes after
software sets MSR_IFSB_CTRL6[bit 26] to 0. MSR_IFSB_CNTR7 acts as a 64-bit event counter for this event.
See Figure 18-55.
Vol. 3B 18-123
PERFORMANCE MONITORING
Enable
18-124 Vol. 3B
PERFORMANCE MONITORING
FSB
GBSQ, GSNPQ,
GINTQ, ... L3
SDI
L2 L2 L2
Almost all of the performance monitoring capabilities available to processor cores with the same CPUID signatures
(see Section 18.1 and Section 18.6.4) apply to Intel Xeon processor 7100 series. The MSRs used by performance
monitoring interface are shared between two logical processors in the same processor core.
The performance monitoring capabilities available to processor with DisplayFamily_DisplayModel signature 06_17H
also apply to Intel Xeon processor 7400 series. Each processor core provides its own set of MSRs for performance
monitoring interface.
The IOQ_allocation and IOQ_active_entries events are not supported in Intel Xeon processor 7100 series and 7400
series. Additional performance monitoring capabilities applicable to the L3/caching bus controller sub-system are
described in this section.
FSB
GBSQ, GSNPQ,
GINTQ, ... L3
SDI
Vol. 3B 18-125
PERFORMANCE MONITORING
18-126 Vol. 3B
PERFORMANCE MONITORING
Saturate
Cross_snoop
Fill_eviction
Core_module_select
L3_state
Snoop_match
Type_match
Data_flow
Agent_select
31 0
• Data_Flow (bits 37:36): Bit 36 specifies demand transactions, bit 37 specifies prefetch transactions.
• Type_Match (bits 43:38): Specifies transaction types. If all six bits are set, event count will include all
transaction types.
• Snoop_Match: (bits 46:44): The three bits specify (in ascending bit position) clean snoop result, HIT snoop
result, and HITM snoop results respectively.
• L3_State (bits 53:47): Each bit specifies an L2 coherency state.
• Core_Module_Select (bits 55:54): The valid encodings for L3 lookup differ slightly between Intel Xeon
processor 7100 and 7400.
For Intel Xeon processor 7100 series,
— 00B: Match transactions from any core in the physical package
— 01B: Match transactions from this core only
— 10B: Match transactions from the other core in the physical package
— 11B: Match transaction from both cores in the physical package
For Intel Xeon processor 7400 series,
— 00B: Match transactions from any dual-core module in the physical package
— 01B: Match transactions from this dual-core module only
— 10B: Match transactions from either one of the other two dual-core modules in the physical package
— 11B: Match transaction from more than one dual-core modules in the physical package
• Fill_Eviction (bits 57:56): The valid encodings are
— 00B: Match any transactions
— 01B: Match transactions that fill L3
— 10B: Match transactions that fill L3 without an eviction
— 11B: Match transaction fill L3 with an eviction
• Cross_Snoop (bit 58): The encodings are
— 0B: Match any transactions
— 1B: Match cross snoop transactions
Vol. 3B 18-127
PERFORMANCE MONITORING
For each counting clock domain, if all eight attributes match, event logic signals to increment the event count field.
18-128 Vol. 3B
PERFORMANCE MONITORING
Saturate
Block_snoop
Core_select
L2_state
Snoop_match
Type_match
Agent_match
31 0
Saturate Reserved
FSB submask
31 0
Vol. 3B 18-129
PERFORMANCE MONITORING
• FSB_L_hitm (bit 40): Count HITM snoop results from any source for transaction originated from this physical
package.
• FSB_L_defer (bit 41): Count DEFER responses to this processor’s transactions.
• FSB_L_retry (bit 42): Count RETRY responses to this processor’s transactions.
• FSB_L_snoop_stall (bit 43): Count snoop stalls to this processor’s transactions.
• FSB_DBSY (bit 44): Count DBSY assertions by this processor (without a concurrent DRDY).
• FSB_DRDY (bit 45): Count DRDY assertions by this processor.
• FSB_BNR (bit 46): Count BNR assertions by this processor.
• FSB_IOQ_empty (bit 47): Counts each bus clocks when the IOQ is empty.
• FSB_IOQ_full (bit 48): Counts each bus clocks when the IOQ is full.
• FSB_IOQ_active (bit 49): Counts each bus clocks when there is at least one entry in the IOQ.
• FSB_WW_data (bit 50): Counts back-to-back write transaction’s data phase.
• FSB_WW_issue (bit 51): Counts back-to-back write transaction request pairs issued by this processor.
• FSB_WR_issue (bit 52): Counts back-to-back write-read transaction request pairs issued by this processor.
• FSB_RW_issue (bit 53): Counts back-to-back read-write transaction request pairs issued by this processor.
• FSB_other_DBSY (bit 54): Count DBSY assertions by another agent (without a concurrent DRDY).
• FSB_other_DRDY (bit 55): Count DRDY assertions by another agent.
• FSB_other_snoop_stall (bit 56): Count snoop stalls on the FSB due to another agent.
• FSB_other_BNR (bit 57): Count BNR assertions from another agent.
18-130 Vol. 3B
PERFORMANCE MONITORING
NOTE
The performance-monitoring events listed in Chapter 19 are intended to be used as guides for
performance tuning. Counter values reported are not guaranteed to be accurate and should be
used as a relative guide for tuning. Known discrepancies are documented where applicable.
The performance-monitoring counters are supported by four MSRs: the performance event select MSRs
(PerfEvtSel0 and PerfEvtSel1) and the performance counter MSRs (PerfCtr0 and PerfCtr1). These registers can be
read from and written to using the RDMSR and WRMSR instructions, respectively. They can be accessed using
these instructions only when operating at privilege level 0. The PerfCtr0 and PerfCtr1 MSRs can be read from any
privilege level using the RDPMC (read performance-monitoring counters) instruction.
NOTE
The PerfEvtSel0, PerfEvtSel1, PerfCtr0, and PerfCtr1 MSRs and the events listed in Table 19-42 are
model-specific for P6 family processors. They are not guaranteed to be available in other IA-32
processors.
31 24 23 22 21 20 19 18 17 16 15 8 7 0
Counter Mask I I U
N E N P E O S Unit Mask (UMASK) Event Select
(CMASK) V N T C S R
• USR (user mode) flag (bit 16) — Specifies that events are counted only when the processor is operating at
privilege levels 1, 2 or 3. This flag can be used in conjunction with the OS flag.
• OS (operating system mode) flag (bit 17) — Specifies that events are counted only when the processor is
operating at privilege level 0. This flag can be used in conjunction with the USR flag.
• E (edge detect) flag (bit 18) — Enables (when set) edge detection of events. The processor counts the
number of deasserted to asserted transitions of any condition that can be expressed by the other fields. The
mechanism is limited in that it does not permit back-to-back assertions to be distinguished. This mechanism
allows software to measure not only the fraction of time spent in a particular state, but also the average length
of time spent in such a state (for example, the time spent waiting for an interrupt to be serviced).
Vol. 3B 18-131
PERFORMANCE MONITORING
• PC (pin control) flag (bit 19) — When set, the processor toggles the PMi pins and increments the counter
when performance-monitoring events occur; when clear, the processor toggles the PMi pins when the counter
overflows. The toggling of a pin is defined as assertion of the pin for a single bus clock followed by deassertion.
• INT (APIC interrupt enable) flag (bit 20) — When set, the processor generates an exception through its
local APIC on counter overflow.
• EN (Enable Counters) Flag (bit 22) — This flag is only present in the PerfEvtSel0 MSR. When set,
performance counting is enabled in both performance-monitoring counters; when clear, both counters are
disabled.
• INV (invert) flag (bit 23) — When set, inverts the counter-mask (CMASK) comparison, so that both greater
than or equal to and less than comparisons can be made (0: greater than or equal; 1: less than). Note if
counter-mask is programmed to zero, INV flag is ignored.
• Counter mask (CMASK) field (bits 24 through 31) — When nonzero, the processor compares this mask to
the number of events counted during a single cycle. If the event count is greater than or equal to this mask, the
counter is incremented by one. Otherwise the counter is not incremented. This mask can be used to count
events only if multiple occurrences happen per clock (for example, two or more instructions retired per clock).
If the counter-mask field is 0, then the counter is incremented each cycle by the number of events that
occurred that cycle.
18-132 Vol. 3B
PERFORMANCE MONITORING
The event monitor feature determination procedure must check whether the current processor supports the
performance-monitoring counters and time-stamp counter. This procedure compares the family and model of the
processor returned by the CPUID instruction with those of processors known to support performance monitoring.
(The Pentium and P6 family processors support performance counters.) The procedure also checks the MSR and
TSC flags returned to register EDX by the CPUID instruction to determine if the MSRs and the RDTSC instruction
are supported.
The initialize and start counters procedure sets the PerfEvtSel0 and/or PerfEvtSel1 MSRs for the events to be
counted and the method used to count them and initializes the counter MSRs (PerfCtr0 and PerfCtr1) to starting
counts. The stop counters procedure stops the performance counters (see Section 18.6.8.3, “Starting and Stop-
ping the Performance-Monitoring Counters”).
The read counters procedure reads the values in the PerfCtr0 and PerfCtr1 MSRs, and a read time-stamp counter
procedure reads the time-stamp counter. These procedures would be provided in lieu of enabling the RDTSC and
RDPMC instructions that allow application code to read the counters.
NOTES
The CESR, CTR0, and CTR1 MSRs and the events listed in Table 19-43 are model-specific for the
Pentium processor.
The performance-monitoring events listed in Chapter 19 are intended to be used as guides for
performance tuning. Counter values reported are not guaranteed to be accurate and should be
used as a relative guide for tuning. Known discrepancies are documented where applicable.
Vol. 3B 18-133
PERFORMANCE MONITORING
31 26 25 24 22 21 16 15 10 9 8 6 5 0
P P
C CC1 ES1 C CC0 ESO
1 0
PC1—Pin control 1
CC1—Counter control 1
ES1—Event select 1
PC0—Pin control 0
CC0—Counter control 0
ES0—Event select 0
Reserved
• CC0 and CC1 (counter control) fields (bits 6-8, bits 22-24) — Controls the operation of the counter.
Control codes are as follows:
000 — Count nothing (counter disabled).
001 — Count the selected event while CPL is 0, 1, or 2.
010 — Count the selected event while CPL is 3.
011 — Count the selected event regardless of CPL.
100 — Count nothing (counter disabled).
101 — Count clocks (duration) while CPL is 0, 1, or 2.
110 — Count clocks (duration) while CPL is 3.
111 — Count clocks (duration) regardless of CPL.
The highest order bit selects between counting events and counting clocks (duration); the middle bit enables
counting when the CPL is 3; and the low-order bit enables counting when the CPL is 0, 1, or 2.
• PC0 and PC1 (pin control) flags (bits 9, 25) — Selects the function of the external performance-monitoring
counter pin (PM0/BP0 and PM1/BP1). Setting one of these flags to 1 causes the processor to assert its
associated pin when the counter has overflowed; setting the flag to 0 causes the pin to be asserted when the
counter has been incremented. These flags permit the pins to be individually programmed to indicate the
overflow or incremented condition. The external signaling of the event on the pins will lag the internal event by
a few clocks as the signals are latched and buffered.
While a counter need not be stopped to sample its contents, it must be stopped and cleared or preset before
switching to a new event. It is not possible to set one counter separately. If only one event needs to be changed,
the CESR register must be read, the appropriate bits modified, and all bits must then be written back to CESR. At
reset, all bits in the CESR register are cleared.
18-134 Vol. 3B
PERFORMANCE MONITORING
entire duration of the event. When the performance-monitor pins are configured to indicate when the counter has
overflowed, the associated PM pin is asserted when the counter has overflowed.
When the PM0/BP0 and/or PM1/BP1 pins are configured to signal that a counter has incremented, it should be
noted that although the counters may increment by 1 or 2 in a single clock, the pins can only indicate that the event
occurred. Moreover, since the internal clock frequency may be higher than the external clock frequency, a single
external clock may correspond to multiple internal clocks.
A “count up to” function may be provided when the event pin is programmed to signal an overflow of the counter.
Because the counters are 40 bits, a carry out of bit 39 indicates an overflow. A counter may be preset to a specific
value less then 240 − 1. After the counter has been enabled and the prescribed number of events has transpired,
the counter will overflow.
Approximately 5 clocks later, the overflow is indicated externally and appropriate action, such as signaling an inter-
rupt, may then be taken.
The PM0/BP0 and PM1/BP1 pins also serve to indicate breakpoint matches during in-circuit emulation, during which
time the counter increment or overflow function of these pins is not available. After RESET, the PM0/BP0 and
PM1/BP1 pins are configured for performance monitoring, however a hardware debugger may reconfigure these
pins to indicate breakpoint matches.
Vol. 3B 18-135
PERFORMANCE MONITORING
• Core Crystal Clock — This is a clock that runs at fixed frequency; it coordinates the clocks on all packages
across the system.
• Non-halted Clockticks — Measures clock cycles in which the specified logical processor is not halted and is
not in any power-saving state. When Intel Hyper-Threading Technology is enabled, ticks can be measured on a
per-logical-processor basis. There are also performance events on dual-core processors that measure
clockticks per logical processor when the processor is not halted.
• Non-sleep Clockticks — Measures clock cycles in which the specified physical processor is not in a sleep mode
or in a power-saving state. These ticks cannot be measured on a logical-processor basis.
• Time-stamp Counter — See Section 17.17, “Time-Stamp Counter”.
• Reference Clockticks — TM2 or Enhanced Intel SpeedStep technology are two examples of processor
features that can cause processor core clockticks to represent non-uniform tick intervals due to change of bus
ratios. Performance events that counts clockticks of a constant reference frequency was introduced Intel Core
Duo and Intel Core Solo processors. The mechanism is further enhanced on processors based on Intel Core
microarchitecture.
Some processor models permit clock cycles to be measured when the physical processor is not in deep sleep (by
using the time-stamp counter and the RDTSC instruction). Note that such ticks cannot be measured on a per-
logical-processor basis. See Section 17.17, “Time-Stamp Counter,” for detail on processor capabilities.
The first two methods use performance counters and can be set up to cause an interrupt upon overflow (for
sampling). They may also be useful where it is easier for a tool to read a performance counter than to use a time
stamp counter (the timestamp counter is accessed using the RDTSC instruction).
For applications with a significant amount of I/O, there are two ratios of interest:
• Non-halted CPI — Non-halted clockticks/instructions retired measures the CPI for phases where the CPU was
being used. This ratio can be measured on a logical-processor basis when Intel Hyper-Threading Technology is
enabled.
• Nominal CPI — Time-stamp counter ticks/instructions retired measures the CPI over the duration of a
program, including those periods when the machine halts while waiting for I/O.
18-136 Vol. 3B
PERFORMANCE MONITORING
Intel® Xeon® Processor Scalable Family with CPUID signature 06_55H. 25 MHz
6th and 7th generation Intel® Core™ processors and Intel® Xeon® W Processor Family. 24 MHz
Next Generation Intel® Atom™ processors based on Goldmont Microarchitecture with 19.2 MHz
CPUID signature 06_5CH (does not include Intel Xeon processors).
NOTES:
1. For any processor in which CPUID.15H is enumerated and MSR_PLATFORM_INFO[15:8] (which gives the scalable bus frequency) is
available, a more accurate frequency can be obtained by using CPUID.15H.
18.7.3.1 For Intel® Processors Based on Microarchitecture Code Name Sandy Bridge, Ivy Bridge,
Haswell and Broadwell
The scalable bus frequency is encoded in the bit field MSR_PLATFORM_INFO[15:8] and the nominal TSC frequency
can be determined by multiplying this number by a bus speed of 100 MHz.
18.7.3.3 For Intel® Atom™ Processors Based on the Silvermont Microarchitecture (Including Intel
Processors Based on Airmont Microarchitecture)
The scalable bus frequency is encoded in the bit field MSR_PLATFORM_INFO[15:8] and the nominal TSC frequency
can be determined by multiplying this number by the scalable bus frequency. The scalable bus frequency is
encoded in the bit field MSR_FSB_FREQ[2:0] for Intel Atom processors based on the Silvermont microarchitecture,
and in bit field MSR_FSB_FREQ[3:0] for processors based on the Airmont microarchitecture; see Chapter 2,
“Model-Specific Registers (MSRs)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
4.
Vol. 3B 18-137
PERFORMANCE MONITORING
18.7.3.4 For Intel® Core™ 2 Processor Family and for Intel® Xeon® Processors Based on Intel Core
Microarchitecture
For processors based on Intel Core microarchitecture, the scalable bus frequency is encoded in the bit field
MSR_FSB_FREQ[2:0] at (0CDH), see Chapter 2, “Model-Specific Registers (MSRs)” in the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 4. The maximum resolved bus ratio can be read from the
following bit field:
• If XE operation is disabled, the maximum resolved bus ratio can be read in MSR_PLATFORM_ID[12:8]. It
corresponds to the Processor Base frequency.
• IF XE operation is enabled, the maximum resolved bus ratio is given in MSR_PERF_STATUS[44:40], it
corresponds to the maximum XE operation frequency configured by BIOS.
XE operation of an Intel 64 processor is implementation specific. XE operation can be enabled only by BIOS. If
MSR_PERF_STATUS[31] is set, XE operation is enabled. The MSR_PERF_STATUS[31] field is read-only.
18-138 Vol. 3B
PERFORMANCE MONITORING
63 16 15 13 12 11 8 7 6 5 4 3 2 1 0
PEBS_OUTPUT_PT_AVAIL (R/O)
PERF_METRICS_AVAILABLE (R/O)
FW_WRITE (R/O)
SMM_FREEZE (R/O)
PEBS_REC_FMT (R/O)
PEBS_ARCH _REG (R/O)
PEBS_TRAP (R/O)
LBR_FMT (R/O)
Reserved
Vol. 3B 18-139
PERFORMANCE MONITORING
63 n 32 31 m 1 0
● ● ● ● ● ●
PEBS_EN_FIXEDn (R/W)
PEBS_EN_FIXED1 (R/W)
PEBS_EN_FIXED0 (R/W)
PEBS_EN_PMCm (R/W)
PEBS_EN_PMC1 (R/W)
PEBS_EN_PMC0 (R/W)
A PEBS record due to a precise event will be generated after an instruction that causes the event when the counter
has already overflowed. A PEBS record due to a non-precise event will occur at the next opportunity after the
counter has overflowed, including immediately after an overflow is set by an MSR write.
Currently, IA32_FIXED_CTR0 counts instructions retired and is a precise event. IA32_FIXED_CTR1,
IA32_FIXED_CTR2 … IA32_FIXED_CTRm count as non-precise events.
The Applicable Counter field in the Basic Info Group of the PEBS record indicates which counters caused the PEBS
record to be generated. It is in the same format as the enable bits for each counter in IA32_PEBS_ENABLE. As an
example, an Applicable Counter field with bits 2 and 32 set would indicate that both general purpose counter 2 and
fixed function counter 0 generated the PEBS record.
• To properly use PEBS for the additional counters, software will need to set up the counter reset values in PEBS
portion of the DS_BUFFER_MANAGEMENT_AREA data structure that is indicated by the IA32_DS_AREA
register. The layout of the DS_BUFFER_MANAGEMENT_AREA is shown in Figure 18-65. When a counter
generates a PEBS records, the appropriate counter reset values will be loaded into that counter. In the above
example where general purpose counter 2 and fixed function counter 0 generated the PEBS record, general
purpose counter 2 would be reloaded with the value contained in PEBS GP Counter 2 Reset (offset 50H) and
fixed function counter 0 would be reloaded with the value contained in PEBS Fixed Counter 0 Reset (offset
80H).
18-140 Vol. 3B
PERFORMANCE MONITORING
IA32_DS_AREA MSR
PEBS Record N
Extended PEBS support debuts on Intel® Atom processors based on the Goldmont Plus microarchitecture and
future Intel® Core™ processors based on the Ice Lake microarchitecture.
Vol. 3B 18-141
PERFORMANCE MONITORING
Details and examples for the Adaptive PEBS capability follow below.
Reserved
AnyThr_FC1
AnyThr_FC0
AnyThr_FC3
AnyThr_FC2
PMI_FC3
PMI_FC2
PMI_FC1
PMI_FC0
EN_FC3
EN_FC2
EN_FC1
EN_FC0
31 23 15 7 0
63 55 47 39 32
FC3_ADAPTIVE_RECORD
FC2_ADAPTIVE_RECORD
FC1_ADAPTIVE_RECORD
FC0_ADAPTIVE_RECORD
IA32_Fixed_CTR_CTRL
Address: 38DH
Scope: Thread
Reset value : 0x00000000 .00000000
18-142 Vol. 3B
PERFORMANCE MONITORING
Record Format [47:0] This field indicates which data groups are included in the record. The field is zero if
none of the counters that triggered the current PEBS record have their
Adaptive_Record bit set. Otherwise it contains the value of MSR_PEBS_DATA_CFG.
[63:48] This field provides the size of the current record in bytes. Selected groups are
packed back-to-back in the record without gaps or padding for unselected groups.
Instruction Pointer [63:0] This field reports the Eventing Instruction Pointer (EventingIP) of the retired
instruction that triggered the PEBS record generation. Note that this field is
different than R/EIP which records the instruction pointer of the next instruction
to be executed after record generation. The legacy R/EIP field has been removed.
Applicable Counters [63:0] The Applicable Counters field indicates which counters triggered the generation of
the PEBS record, linking the record to specific events. This allows software to
correlate the PEBS record entry properly with the instruction that caused the
event, even when multiple counters are configured to generate PEBS records and
multiple bits are set in the field.
TSC [63:0] This field provides the time stamp counter value when the PEBS record was
generated.
Memory Access Address [63:0] This field contains the linear address of the source of the load, or linear address of
the destination (target) of the store. This value is written as a 64-bit address in
canonical form.
Memory Auxiliary Info [63:0] When MEM_TRANS_RETIRED.* event is configured in a General Purpose counter,
this field contains an encoded value indicating the memory hierarchy source which
satisfied the load. These encodings are detailed in Table 18-4 and Table 18-13. If
the PEBS assist was triggered for a store uop, this field will contain information
indicating the status of the store, as detailed in Table 18-14.
Memory Access Latency [63:0] When MEM_TRANS_RETIRED.* event is configured in a General Purpose counter,
this field contains the latency to service the load in core clock cycles.
TSX Auxiliary Info [31:0] This field contains the number of cycles in the last TSX region, regardless of
whether that region had aborted or committed.
[63:31] This field contains the abort details. Refer to Section 18.3.6.5.1.
18.9.2.2.3 GPRs
This group is captured when the GPR bit is enabled in MSR_PEBS_DATA_CFG. GPRs are always 64 bits wide. If they
are selected for non 64-bit mode, the upper 32-bit of the legacy RAX - RDI and all contents of R8-15 GPRs will be
filled with 0s. In 64bit mode, the full 64 bit value of each register is written.
Vol. 3B 18-143
PERFORMANCE MONITORING
The order differs from legacy. The table below shows the order of the GPRs in Ice Lake microarchitecture.
RFLAGS [63:0]
RIP [63:0]
RAX [63:0]
RCX* [63:0]
RDX* [63:0]
RBX* [63:0]
RSP* [63:0]
RBP* [63:0]
RSI* [63:0]
RDI* [63:0]
R8 [63:0]
... ...
R15 [63:0]
The machine state reported in the PEBS record is the committed machine state immediately after the instruction
that triggers PEBS completes.
For instance, consider the following instruction sequence:
MOV eax, [eax]; triggers PEBS record generation
NOP
If the mov instruction triggers PEBS record generation, the EventingIP field in the PEBS record will report the
address of the mov, and the value of EAX in the PEBS record will show the value read from memory, not the target
address of the read operation. And the value of RIP will contain the linear address of the nop.
18.9.2.2.4 XMMs
This group is captured when the XMM bit is enabled in MSR_PEBS_DATA_CFG and SSE is enabled. If SSE is not
enabled, the fields will contain zeroes. XMM8-XMM15 will also contain zeroes if not in 64-bit mode.
18-144 Vol. 3B
PERFORMANCE MONITORING
XMM0 [127:0]
... ...
XMM15 [127:0]
18.9.2.2.5 LBRs
To capture LBR data in the PEBS record, the LBR bit in MSR_PEBS_DATA_CFG must be enabled. The number of LBR
entries included in the record can be configured in the LBR_entries field of MSR_PEBS_DATA_CFG.
LBR[].INFO [63:0] Other LBR information, like timing. This field is described in more
detail in Section 17.12.1, “MSR_LBR_INFO_x MSR”.
LBR entries are recorded into the record starting at LBR[TOS] and proceeding to LBR[TOS-1] and following. Note
that LBR index is modulo the number of LBRs supporting on the processor.
18.9.2.3 MSR_PEBS_DATA_CFG
Bits in MSR_PEBS_DATA_CFG can be set to include data field blocks/groups into adaptive records. The Basic Info
group is always included in the record. Additionally, the number of LBR entries included in the record is configu-
rable.
Vol. 3B 18-145
PERFORMANCE MONITORING
Reserved
LBR Entries
Memory Info
XMMs
GPRs
LBRs
31 23 15 7 31
63 55 47 39 63
MSR_PEBS_DATA_CFG
Address: 3F2H
Scope: Thread
Reset value : 0x00000000 .00000000
Memory Info 0 R/W Setting this bit will capture memory information such as the linear address,
data source and latency of the memory access in the PEBS record.
GPRs 1 R/W Setting this bit will capture the contents of the General Purpose registers
in the PEBS record.
XMMs 2 R/W Setting this bit will capture the contents of the XMM registers in the PEBS
record.
LBRs 3 R/W Setting this bit will capture LBR TO, FROM and INFO in the PEBS record.
LBR Entries 31:24 R/W Set the field to the desired number of entries minus 1. For example, if the
LBR_entries field is 0, a single entry will be included in the record. To
include 32 LBR entries, set the LBR_entries field to 31 (0x1F). To ensure
all PEBS records are 16-byte aligned, it is recommended to select an even
number of LBR entries (programmed into LBR_entries as an odd number).
NOTES:
1. A write to the MSR will be ignored when IA32_MISC_ENABLES.PERFMON_AVAILABLE is zero (default).
2. Writing to the reserved bits will cause a GP fault.
18-146 Vol. 3B
PERFORMANCE MONITORING
0x18 TSC
0x48 RIP
0x50 RAX
... ...
0x88 RDI
0x90 R8
... ...
0xC8 R15
... ...
0x1C0 XMM15
0x1D8 LBR[TOS].TO
0x1E0 LBR[TOS].INFO
... ...
Vol. 3B 18-147
PERFORMANCE MONITORING
The following example shows the layout of the PEBS record when Basic, GPR, and LBR group with 3 LBR entries are
selected.
0x18 TSC
0x28 RIP
0x30 RAX
... ...
0x68 RDI
0x70 R8
... ...
0xA8 R15
0xB8 LBR[TOS].TO
0xC0 LBR[TOS].INFO
... ...
18-148 Vol. 3B
18.Updates to Chapter 19, Volume 3B
Change bars and green text show changes to Chapter 19 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3B: System Programming Guide, Part 2.
------------------------------------------------------------------------------------------
Changes to this chapter: Minor typo corrections.
NOTE
The event tables listed this chapter provide information for tool developers to support architectural
and model-specific performance monitoring events. The tables are up to date at processor launch,
but are subject to changes. The most up to date event tables and additional details of performance
event implementation can be found here:
1) In the document titled “Intel® 64 and IA32 Architectures Performance Monitoring Events”,
Document number: 335279, located here:
https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_
events_guide.pdf.
These event tables are also available on the web and are located at:
https://download.01.org/perfmon/index/.
2) Performance monitoring event files for Intel processors are hosted by the Intel Open Source
Technology Center. These files can be downloaded here: https://download.01.org/perfmon/.
This chapter lists the performance monitoring events that can be monitored with the Intel 64 or IA-32 processors.
The ability to monitor performance events and the events that can be monitored in these processors are mostly
model-specific, except for architectural performance events, described in Section 19.1.
Model-specific performance events are listed for each generation of microarchitecture:
• Section 19.2 - Processors based on Skylake microarchitecture and processors based on Cascade Lake product
• Section 19.3 - Processors based on Ice Lake microarchitecture
• Section 19.4 - Processors based on Skylake, Kaby Lake and Coffee Lake microarchitectures
• Section 19.5 - Processors based on Knights Landing and Knights Mill microarchitectures
• Section 19.6 - Processors based on Broadwell microarchitecture
• Section 19.7 - Processors based on Haswell microarchitecture
• Section 19.7.1 - Processors based on Haswell-E microarchitecture
• Section 19.8 - Processors based on Ivy Bridge microarchitecture
• Section 19.8.1 - Processors based on Ivy Bridge-E microarchitecture
• Section 19.9 - Processors based on Sandy Bridge microarchitecture
• Section 19.10 - Processors based on Intel® microarchitecture code name Nehalem
• Section 19.11 - Processors based on Intel® microarchitecture code name Westmere
• Section 19.12 - Processors based on Enhanced Intel® Core™ microarchitecture
• Section 19.13 - Processors based on Intel® Core™ microarchitecture
• Section 19.14 - Processors based on the Goldmont Plus microarchitecture
• Section 19.15 - Processors based on the Goldmont microarchitecture
• Section 19.16 - Processors based on the Silvermont microarchitecture
• Section 19.16.1 - Processors based on the Airmont microarchitecture
• Section 19.17 - 45 nm and 32 nm Intel® Atom™ Processors
• Section 19.18 - Intel® Core™ Solo and Intel® Core™ Duo processors
• Section 19.19 - Processors based on Intel NetBurst® microarchitecture
• Section 19.20 - Pentium® M family processors
• Section 19.21 - P6 family processors
• Section 19.22 - Pentium® processors
Vol. 3B 19-1
PERFORMANCE MONITORING EVENTS
NOTE
These performance monitoring events are intended to be used as guides for performance tuning.
The counter values reported by the performance monitoring events are approximate and believed
to be useful as relative guides for tuning software. Known discrepancies are documented where
applicable.
All performance event encodings not documented in the appropriate tables for the given processor
are considered reserved, and their use will result in undefined counter updates with associated
overflow actions.
NOTES:
1. Current implementations count at core crystal clock, TSC, or bus clock frequency.
19-2 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-2. Fixed-Function Performance Counter and Pre-defined Performance Events (Contd.)
Fixed-Function Performance
Counter Address Event Mask Mnemonic Description
IA32_PERF_FIXED_CTR1 30AH CPU_CLK_UNHALTED.THRE The CPU_CLK_UNHALTED.THREAD event counts the
AD/CPU_CLK_UNHALTED.C number of core cycles while the logical processor is not in
ORE/CPU_CLK_UNHALTED. a halt state.
THREAD_ANY If there is only one logical processor in a processor core,
CPU_CLK_UNHALTED.CORE counts the unhalted cycles of
the processor core.
If there are more than one logical processor in a processor
core, CPU_CLK_UNHALTED.THREAD_ANY is supported by
programming IA32_FIXED_CTR_CTRL[bit 6]AnyThread =
1.
The core frequency may change from time to time due to
transitions associated with Enhanced Intel SpeedStep
Technology or TM2. For this reason this event may have a
changing ratio with regards to time.
IA32_PERF_FIXED_CTR2 30BH CPU_CLK_UNHALTED.REF_ This event counts the number of reference cycles at the
TSC TSC rate when the core is not in a halt state and not in a
TM stop-clock state. The core enters the halt state when
it is running the HLT instruction or the MWAIT instruction.
This event is not affected by core frequency changes (e.g.,
P states) but counts at the same frequency as the time
stamp counter. This event can approximate elapsed time
while the core was not in a halt state and not in a TM
stopclock state.
Vol. 3B 19-3
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product
Event Umask Event Mask Mnemonic Description Comment
Num. Value
00H 01H INST_RETIRED.ANY Counts the number of instructions retired from Fixed Counter
execution. For instructions that consist of multiple
micro-ops, Counts the retirement of the last micro-op of
the instruction. Counting continues during hardware
interrupts, traps, and inside interrupt handlers. Notes:
INST_RETIRED.ANY is counted by a designated fixed
counter, leaving the four (eight when Hyperthreading is
disabled) programmable counters available for other
events. INST_RETIRED.ANY_P is counted by a
programmable counter and it is an architectural
performance event. Counting: Faulting executions of
GETSEC/VM entry/VM Exit/MWait will not count as
retired instructions.
00H 02H CPU_CLK_UNHALTED.THREAD Counts the number of core cycles while the thread is Fixed Counter
not in a halt state. The thread enters the halt state
when it is running the HLT instruction. This event is a
component in many key event ratios. The core
frequency may change from time to time due to
transitions associated with Enhanced Intel SpeedStep
Technology or TM2. For this reason this event may
have a changing ratio with regards to time. When the
core frequency is constant, this event can approximate
elapsed time while the core was not in the halt state. It
is counted on a dedicated fixed counter, leaving the
four (eight when Hyperthreading is disabled)
programmable counters available for other events.
00H 02H CPU_CLK_UNHALTED.THREAD_ Core cycles when at least one thread on the physical AnyThread=1
ANY core is not in halt state.
19-4 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
00H 03H CPU_CLK_UNHALTED.REF_TSC Counts the number of reference cycles when the core is Fixed Counter
not in a halt state. The core enters the halt state when
it is running the HLT instruction or the MWAIT
instruction. This event is not affected by core
frequency changes (for example, P states, TM2
transitions) but has the same incrementing frequency
as the time stamp counter. This event can approximate
elapsed time while the core was not in a halt state. This
event has a constant ratio with the
CPU_CLK_UNHALTED.REF_XCLK event. It is counted on
a dedicated fixed counter, leaving the four (eight when
Hyperthreading is disabled) programmable counters
available for other events. Note: On all current
platforms this event stops counting during ‘throttling
(TM)’ states duty off periods the processor is ‘halted’.
The counter update is done at a lower clock rate then
the core clock the overflow status bit for this counter
may appear ‘sticky’. After the counter has overflowed
and software clears the overflow status bit and resets
the counter to less than MAX. The reset value to the
counter is not clocked immediately so the overflow
status bit will flip “high (1)” and generate another PMI
(if enabled) after which the reset value gets clocked
into the counter. Therefore, software will get the
interrupt, read the overflow status bit ‘1 for bit 34
while the counter value is less than MAX. Software
should ignore this case.
03H 02H LD_BLOCKS.STORE_FORWARD Counts how many times the load operation got the true
Block-on-Store blocking code preventing store
forwarding. This includes cases when: a. preceding
store conflicts with the load (incomplete overlap), b.
store forwarding is impossible due to u-arch limitations,
c. preceding lock RMW operations are not forwarded, d.
store has the no-forward bit set (uncacheable/page-
split/masked stores), e. all-blocking stores are used
(mostly, fences and port I/O), and others. The most
common case is a load blocked due to its address range
overlapping with a preceding smaller uncompleted
store. Note: This event does not take into account cases
of out-of-SW-control (for example, SbTailHit), unknown
physical STA, and cases of blocking loads on store due
to being non-WB memory type or a lock. These cases
are covered by other events. See the table of not
supported store forwards in the Optimization Guide.
03H 08H LD_BLOCKS.NO_SR The number of times that split load operations are
temporarily blocked because all resources for handling
the split accesses are in use.
07H 01H LD_BLOCKS_PARTIAL.ADDRESS Counts false dependencies in MOB when the partial
_ALIAS comparison upon loose net check and dependency was
resolved by the Enhanced Loose net mechanism. This
may not result in high performance penalties. Loose net
checks can fail when loads and stores are 4k aliased.
Vol. 3B 19-5
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
08H 01H DTLB_LOAD_MISSES.MISS_CAUS Counts demand data loads that caused a page walk of
ES_A_WALK any page size (4K/2M/4M/1G). This implies it missed in
all TLB levels, but the walk need not have completed.
08H 02H DTLB_LOAD_MISSES.WALK_COM Counts demand data loads that caused a completed
PLETED_4K page walk (4K page size). This implies it missed in all
TLB levels. The page walk can end with or without a
fault.
08H 04H DTLB_LOAD_MISSES.WALK_COM Counts demand data loads that caused a completed
PLETED_2M_4M page walk (2M and 4M page sizes). This implies it
missed in all TLB levels. The page walk can end with or
without a fault.
08H 08H DTLB_LOAD_MISSES.WALK_COM Counts load misses in all DTLB levels that cause a
PLETED_1G completed page walk (1G page size). The page walk can
end with or without a fault.
08H 0EH DTLB_LOAD_MISSES.WALK_COM Counts demand data loads that caused a completed
PLETED page walk of any page size (4K/2M/4M/1G). This implies
it missed in all TLB levels. The page walk can end with
or without a fault.
08H 10H DTLB_LOAD_MISSES.WALK_PEN Counts 1 per cycle for each PMH that is busy with a
DING page walk for a load. EPT page walk duration are
excluded in Skylake microarchitecture.
08H 10H DTLB_LOAD_MISSES.WALK_ACT Counts cycles when at least one PMH (Page Miss CounterMask=1
IVE Handler) is busy with a page walk for a load. CMSK1
08H 20H DTLB_LOAD_MISSES.STLB_HIT Counts loads that miss the DTLB (Data TLB) and hit the
STLB (Second level TLB).
0DH 01H INT_MISC.RECOVERY_CYCLES Core cycles the Resource allocator was stalled due to
recovery from an earlier branch misprediction or
machine clear event.
0DH 01H INT_MISC.RECOVERY_CYCLES_A Core cycles the allocator was stalled due to recovery AnyThread=1 AnyT
NY from earlier clear event for any thread running on the
physical core (e.g. misprediction or memory nuke).
0DH 80H INT_MISC.CLEAR_RESTEER_CYC Cycles the issue-stage is waiting for front-end to fetch
LES from resteered path following branch misprediction or
machine clear events.
0EH 01H UOPS_ISSUED.ANY Counts the number of uops that the Resource
Allocation Table (RAT) issues to the Reservation Station
(RS).
0EH 01H UOPS_ISSUED.STALL_CYCLES Counts cycles during which the Resource Allocation CounterMask=1
Table (RAT) does not issue any uops to the reservation Invert=1 CMSK1, INV
station (RS) for the current thread.
0EH 02H UOPS_ISSUED.VECTOR_WIDTH_ Counts the number of Blend Uops issued by the
MISMATCH Resource Allocation Table (RAT) to the reservation
station (RS) in order to preserve upper bits of vector
registers. Starting with the Skylake microarchitecture,
these Blend uops are needed since every Intel SSE
instruction executed in Dirty Upper State needs to
preserve bits 128-255 of the destination register. For
more information, refer to Mixing Intel AVX and Intel
SSE Code section of the Optimization Guide.
19-6 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
0EH 20H UOPS_ISSUED.SLOW_LEA Number of slow LEA uops being allocated. A uop is
generally considered SlowLea if it has 3 sources (e.g. 2
sources + immediate) regardless if as a result of LEA
instruction or not.
14H 01H ARITH.DIVIDER_ACTIVE Cycles when divide unit is busy executing divide or CounterMask=1
square root operations. Accounts for integer and
floating-point operations.
24H 21H L2_RQSTS.DEMAND_DATA_RD_ Counts the number of demand Data Read requests that
MISS miss L2 cache. Only not rejected loads are counted.
24H 22H L2_RQSTS.RFO_MISS Counts the RFO (Read-for-Ownership) requests that
miss L2 cache.
24H 24H L2_RQSTS.CODE_RD_MISS Counts L2 cache misses when fetching instructions.
24H 27H L2_RQSTS.ALL_DEMAND_MISS Demand requests that miss L2 cache.
24H 38H L2_RQSTS.PF_MISS Counts requests from the L1/L2/L3 hardware
prefetchers or Load software prefetches that miss L2
cache.
24H 3FH L2_RQSTS.MISS All requests that miss L2 cache.
24H 41H L2_RQSTS.DEMAND_DATA_RD_ Counts the number of demand Data Read requests that
HIT hit L2 cache. Only non rejected loads are counted.
24H 42H L2_RQSTS.RFO_HIT Counts the RFO (Read-for-Ownership) requests that hit
L2 cache.
24H 44H L2_RQSTS.CODE_RD_HIT Counts L2 cache hits when fetching instructions, code
reads.
24H D8H L2_RQSTS.PF_HIT Counts requests from the L1/L2/L3 hardware
prefetchers or Load software prefetches that hit L2
cache.
24H E1H L2_RQSTS.ALL_DEMAND_DATA Counts the number of demand Data Read requests
_RD (including requests from L1D hardware prefetchers).
These loads may hit or miss L2 cache. Only non rejected
loads are counted.
24H E2H L2_RQSTS.ALL_RFO Counts the total number of RFO (read for ownership)
requests to L2 cache. L2 RFO requests include both
L1D demand RFO misses as well as L1D RFO
prefetches.
24H E4H L2_RQSTS.ALL_CODE_RD Counts the total number of L2 code requests.
24H E7H L2_RQSTS.ALL_DEMAND_REFE Demand requests to L2 cache.
RENCES
24H F8H L2_RQSTS.ALL_PF Counts the total number of requests from the L2
hardware prefetchers.
24H FFH L2_RQSTS.REFERENCES All L2 requests.
28H 07H CORE_POWER.LVL0_TURBO_LIC Core cycles where the core was running with power-
ENSE delivery for baseline license level 0. This includes non-
AVX codes, SSE, AVX 128-bit, and low-current AVX
256-bit codes.
Vol. 3B 19-7
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
28H 18H CORE_POWER.LVL1_TURBO_LIC Core cycles where the core was running with power-
ENSE delivery for license level 1. This includes high current
AVX 256-bit instructions as well as low current AVX
512-bit instructions.
28H 20H CORE_POWER.LVL2_TURBO_LIC Core cycles where the core was running with power-
ENSE delivery for license level 2 (introduced in Skylake
Server microarchitecture). This includes high current
AVX 512-bit instructions.
28H 40H CORE_POWER.THROTTLE Core cycles the out-of-order engine was throttled due
to a pending power level request.
2EH 41H LONGEST_LAT_CACHE.MISS Counts core-originated cacheable requests that miss See Table 19-1.
the L3 cache (Longest Latency cache). Requests include
data and code reads, Reads-for-Ownership (RFOs),
speculative accesses and hardware prefetches from L1
and L2. It does not include all misses to the L3.
2EH 4FH LONGEST_LAT_CACHE.REFEREN Counts core-originated cacheable requests to the L3 See Table 19-1.
CE cache (Longest Latency cache). Requests include data
and code reads, Reads-for-Ownership (RFOs),
speculative accesses and hardware prefetches from L1
and L2. It does not include all accesses to the L3.
3CH 00H CPU_CLK_UNHALTED.THREAD_ This is an architectural event that counts the number of See Table 19-1.
P thread cycles while the thread is not in a halt state. The
thread enters the halt state when it is running the HLT
instruction. The core frequency may change from time
to time due to power or thermal throttling. For this
reason, this event may have a changing ratio with
regards to wall clock time.
3CH 00H CPU_CLK_UNHALTED.THREAD_ Core cycles when at least one thread on the physical AnyThread=1 AnyT
P_ANY core is not in halt state.
3CH 00H CPU_CLK_UNHALTED.RING0_TR Counts when the Current Privilege Level (CPL) EdgeDetect=1
ANS transitions from ring 1, 2 or 3 to ring 0 (Kernel). CounterMask=1
3CH 01H CPU_CLK_THREAD_UNHALTED. Core crystal clock cycles when the thread is unhalted. See Table 19-1.
REF_XCLK
3CH 01H CPU_CLK_THREAD_UNHALTED. Core crystal clock cycles when at least one thread on AnyThread=1 AnyT
REF_XCLK_ANY the physical core is unhalted.
3CH 01H CPU_CLK_UNHALTED.REF_XCLK Core crystal clock cycles when the thread is unhalted. See Table 19-1.
3CH 01H CPU_CLK_UNHALTED.REF_XCLK Core crystal clock cycles when at least one thread on AnyThread=1 AnyT
_ANY the physical core is unhalted.
3CH 02H CPU_CLK_THREAD_UNHALTED. Core crystal clock cycles when this thread is unhalted
ONE_THREAD_ACTIVE and the other thread is halted.
3CH 02H CPU_CLK_UNHALTED.ONE_THR Core crystal clock cycles when this thread is unhalted
EAD_ACTIVE and the other thread is halted.
19-8 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
48H 01H L1D_PEND_MISS.PENDING Counts duration of L1D miss outstanding, that is each
cycle number of Fill Buffers (FB) outstanding required
by Demand Reads. FB either is held by demand loads, or
it is held by non-demand loads and gets hit at least
once by demand. The valid outstanding interval is
defined until the FB deallocation by one of the
following ways: from FB allocation, if FB is allocated by
demand from the demand Hit FB, if it is allocated by
hardware or software prefetch.Note: In the L1D, a
Demand Read contains cacheable or noncacheable
demand loads, including ones causing cache-line splits
and reads due to page walks resulted from any request
type.
48H 01H L1D_PEND_MISS.PENDING_CYCL Counts duration of L1D miss outstanding in cycles. CounterMask=1
ES CMSK1
48H 01H L1D_PEND_MISS.PENDING_CYCL Cycles with L1D load Misses outstanding from any CounterMask=1
ES_ANY thread on physical core. AnyThread=1
CMSK1, AnyT
48H 02H L1D_PEND_MISS.FB_FULL Number of times a request needed a FB (Fill Buffer)
entry but there was no entry available for it. A request
includes cacheable/uncacheable demands that are load,
store or SW prefetch instructions.
49H 01H DTLB_STORE_MISSES.MISS_CAU Counts demand data stores that caused a page walk of
SES_A_WALK any page size (4K/2M/4M/1G). This implies it missed in
all TLB levels, but the walk need not have completed.
49H 02H DTLB_STORE_MISSES.WALK_CO Counts demand data stores that caused a completed
MPLETED_4K page walk (4K page size). This implies it missed in all
TLB levels. The page walk can end with or without a
fault.
49H 04H DTLB_STORE_MISSES.WALK_CO Counts demand data stores that caused a completed
MPLETED_2M_4M page walk (2M and 4M page sizes). This implies it
missed in all TLB levels. The page walk can end with or
without a fault.
49H 08H DTLB_STORE_MISSES.WALK_CO Counts store misses in all DTLB levels that cause a
MPLETED_1G completed page walk (1G page size). The page walk can
end with or without a fault.
49H 0EH DTLB_STORE_MISSES.WALK_CO Counts demand data stores that caused a completed
MPLETED page walk of any page size (4K/2M/4M/1G). This implies
it missed in all TLB levels. The page walk can end with
or without a fault.
49H 10H DTLB_STORE_MISSES.WALK_PE Counts 1 per cycle for each PMH that is busy with a
NDING page walk for a store. EPT page walk duration are
excluded in Skylake microarchitecture.
49H 10H DTLB_STORE_MISSES.WALK_AC Counts cycles when at least one PMH (Page Miss CounterMask=1
TIVE Handler) is busy with a page walk for a store. CMSK1
49H 20H DTLB_STORE_MISSES.STLB_HIT Stores that miss the DTLB (Data TLB) and hit the STLB
(2nd Level TLB).
Vol. 3B 19-9
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
4CH 01H LOAD_HIT_PRE.SW_PF Counts all not software-prefetch load dispatches that
hit the fill buffer (FB) allocated for the software
prefetch. It can also be incremented by some lock
instructions. So it should only be used with profiling so
that the locks can be excluded by ASM (Assembly File)
inspection of the nearby instructions.
4FH 10H EPT.WALK_PENDING Counts cycles for each PMH (Page Miss Handler) that is
busy with an EPT (Extended Page Table) walk for any
request type.
51H 01H L1D.REPLACEMENT Counts L1D data line replacements including
opportunistic replacements, and replacements that
require stall-for-replace or block-for-replace.
54H 01H TX_MEM.ABORT_CONFLICT Number of times a TSX line had a cache conflict.
54H 02H TX_MEM.ABORT_CAPACITY Number of times a transactional abort was signaled due
to a data capacity limitation for transactional reads or
writes.
54H 04H TX_MEM.ABORT_HLE_STORE_T Number of times a TSX Abort was triggered due to a
O_ELIDED_LOCK non-release/commit store to lock.
54H 08H TX_MEM.ABORT_HLE_ELISION_ Number of times a TSX Abort was triggered due to
BUFFER_NOT_EMPTY commit but Lock Buffer not empty.
54H 10H TX_MEM.ABORT_HLE_ELISION_ Number of times a TSX Abort was triggered due to
BUFFER_MISMATCH release/commit but data and address mismatch.
54H 20H TX_MEM.ABORT_HLE_ELISION_ Number of times a TSX Abort was triggered due to
BUFFER_UNSUPPORTED_ALIGN attempting an unsupported alignment from Lock
MENT Buffer.
54H 40H TX_MEM.HLE_ELISION_BUFFER Number of times we could not allocate Lock Buffer.
_FULL
5DH 01H TX_EXEC.MISC1 Unfriendly TSX abort triggered by a flowmarker.
5DH 02H TX_EXEC.MISC2 Unfriendly TSX abort triggered by a vzeroupper
instruction.
5DH 04H TX_EXEC.MISC3 Unfriendly TSX abort triggered by a nest count that is
too deep.
5DH 08H TX_EXEC.MISC4 RTM region detected inside HLE.
5DH 10H TX_EXEC.MISC5 Counts the number of times an HLE XACQUIRE
instruction was executed inside an RTM transactional
region.
5EH 01H RS_EVENTS.EMPTY_CYCLES Counts cycles during which the reservation station (RS)
is empty for the thread.; Note: In ST-mode, not active
thread should drive 0. This is usually caused by
severely costly branch mispredictions, or allocator/FE
issues.
5EH 01H RS_EVENTS.EMPTY_END Counts end of periods where the Reservation Station EdgeDetect=1
(RS) was empty. Could be useful to precisely locate CounterMask=1
front-end Latency Bound issues. Invert=1 CMSK1, INV
19-10 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
60H 01H OFFCORE_REQUESTS_OUTSTAN Counts the number of offcore outstanding Demand
DING.DEMAND_DATA_RD Data Read transactions in the super queue (SQ) every
cycle. A transaction is considered to be in the Offcore
outstanding state between L2 miss and transaction
completion sent to requestor. See the corresponding
Umask under OFFCORE_REQUESTS. Note: A prefetch
promoted to Demand is counted from the promotion
point.
60H 01H OFFCORE_REQUESTS_OUTSTAN Counts cycles when offcore outstanding Demand Data CounterMask=1
DING.CYCLES_WITH_DEMAND_D Read transactions are present in the super queue (SQ). CMSK1
ATA_RD A transaction is considered to be in the Offcore
outstanding state between L2 miss and transaction
completion sent to requestor (SQ de-allocation).
60H 01H OFFCORE_REQUESTS_OUTSTAN Cycles with at least 6 offcore outstanding Demand Data CounterMask=6
DING.DEMAND_DATA_RD_GE_6 Read transactions in uncore queue. CMSK6
60H 02H OFFCORE_REQUESTS_OUTSTAN Counts the number of offcore outstanding Code Reads CounterMask=1
DING.DEMAND_CODE_RD transactions in the super queue every cycle. The CMSK1
'Offcore outstanding' state of the transaction lasts
from the L2 miss until the sending transaction
completion to requestor (SQ deallocation). See the
corresponding Umask under OFFCORE_REQUESTS.
60H 02H OFFCORE_REQUESTS_OUTSTAN Counts the number of offcore outstanding Code Reads CMSK1
DING.CYCLES_WITH_DEMAND_C transactions in the super queue every cycle. The
ODE_RD 'Offcore outstanding' state of the transaction lasts
from the L2 miss until the sending transaction
completion to requestor (SQ deallocation). See the
corresponding Umask under OFFCORE_REQUESTS.
60H 04H OFFCORE_REQUESTS_OUTSTAN Counts the number of offcore outstanding RFO (store) CounterMask=1
DING.DEMAND_RFO transactions in the super queue (SQ) every cycle. A CMSK1
transaction is considered to be in the Offcore
outstanding state between L2 miss and transaction
completion sent to requestor (SQ de-allocation). See
corresponding Umask under OFFCORE_REQUESTS.
60H 04H OFFCORE_REQUESTS_OUTSTAN Counts the number of offcore outstanding demand rfo CMSK1
DING.CYCLES_WITH_DEMAND_R Reads transactions in the super queue every cycle. The
FO 'Offcore outstanding' state of the transaction lasts
from the L2 miss until the sending transaction
completion to requestor (SQ deallocation). See the
corresponding Umask under OFFCORE_REQUESTS.
60H 08H OFFCORE_REQUESTS_OUTSTAN Counts the number of offcore outstanding cacheable
DING.ALL_DATA_RD Core Data Read transactions in the super queue every
cycle. A transaction is considered to be in the Offcore
outstanding state between L2 miss and transaction
completion sent to requestor (SQ de-allocation). See
corresponding Umask under OFFCORE_REQUESTS.
60H 08H OFFCORE_REQUESTS_OUTSTAN Counts cycles when offcore outstanding cacheable Core CounterMask=1
DING.CYCLES_WITH_DATA_RD Data Read transactions are present in the super queue. CMSK1
A transaction is considered to be in the Offcore
outstanding state between L2 miss and transaction
completion sent to requestor (SQ de-allocation). See
corresponding Umask under OFFCORE_REQUESTS.
Vol. 3B 19-11
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
60H 10H OFFCORE_REQUESTS_OUTSTAN Counts number of Offcore outstanding Demand Data
DING.L3_MISS_DEMAND_DATA_ Read requests that miss L3 cache in the superQ every
RD cycle.
60H 10H OFFCORE_REQUESTS_OUTSTAN Cycles with at least 1 Demand Data Read requests who CounterMask=1
DING.CYCLES_WITH_L3_MISS_D miss L3 cache in the superQ. CMSK1
EMAND_DATA_RD
60H 10H OFFCORE_REQUESTS_OUTSTAN Cycles with at least 6 Demand Data Read requests that CounterMask=6
DING.L3_MISS_DEMAND_DATA_ miss L3 cache in the superQ. CMSK6
RD_GE_6
79H 04H IDQ.MITE_UOPS Counts the number of uops delivered to Instruction
Decode Queue (IDQ) from the MITE path. Counting
includes uops that may 'bypass' the IDQ. This also
means that uops are not being delivered from the
Decode Stream Buffer (DSB).
79H 04H IDQ.MITE_CYCLES Counts cycles during which uops are being delivered to CounterMask=1
Instruction Decode Queue (IDQ) from the MITE path. CMSK1
Counting includes uops that may 'bypass' the IDQ.
79H 08H IDQ.DSB_UOPS Counts the number of uops delivered to Instruction
Decode Queue (IDQ) from the Decode Stream Buffer
(DSB) path. Counting includes uops that may ‘bypass’
the IDQ.
79H 08H IDQ.DSB_CYCLES Counts cycles during which uops are being delivered to CounterMask=1
Instruction Decode Queue (IDQ) from the Decode CMSK1
Stream Buffer (DSB) path. Counting includes uops that
may 'bypass' the IDQ.
79H 10H IDQ.MS_DSB_CYCLES Counts cycles during which uops initiated by Decode CounterMask=1
Stream Buffer (DSB) are being delivered to Instruction
Decode Queue (IDQ) while the Microcode Sequencer
(MS) is busy. Counting includes uops that may 'bypass'
the IDQ.
79H 18H IDQ.ALL_DSB_CYCLES_4_UOPS Counts the number of cycles 4 uops were delivered to CounterMask=4
Instruction Decode Queue (IDQ) from the Decode CMSK4
Stream Buffer (DSB) path. Count includes uops that may
'bypass' the IDQ.
79H 18H IDQ.ALL_DSB_CYCLES_ANY_UO Counts the number of cycles uops were delivered to CounterMask=1
PS Instruction Decode Queue (IDQ) from the Decode CMSK1
Stream Buffer (DSB) path. Count includes uops that may
'bypass' the IDQ.
79H 20H IDQ.MS_MITE_UOPS Counts the number of uops initiated by MITE and
delivered to Instruction Decode Queue (IDQ) while the
Microcode Sequencer (MS) is busy. Counting includes
uops that may 'bypass' the IDQ.
79H 24H IDQ.ALL_MITE_CYCLES_4_UOPS Counts the number of cycles 4 uops were delivered to CounterMask=4
the Instruction Decode Queue (IDQ) from the MITE CMSK4
(legacy decode pipeline) path. Counting includes uops
that may 'bypass' the IDQ. During these cycles uops are
not being delivered from the Decode Stream Buffer
(DSB).
19-12 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
79H 24H IDQ.ALL_MITE_CYCLES_ANY_UO Counts the number of cycles uops were delivered to CounterMask=1
PS the Instruction Decode Queue (IDQ) from the MITE CMSK1
(legacy decode pipeline) path. Counting includes uops
that may 'bypass' the IDQ. During these cycles uops are
not being delivered from the Decode Stream Buffer
(DSB).
79H 30H IDQ.MS_CYCLES Counts cycles during which uops are being delivered to CounterMask=1
Instruction Decode Queue (IDQ) while the Microcode CMSK1
Sequencer (MS) is busy. Counting includes uops that
may 'bypass' the IDQ. Uops maybe initiated by Decode
Stream Buffer (DSB) or MITE.
79H 30H IDQ.MS_SWITCHES Number of switches from DSB (Decode Stream Buffer) EdgeDetect=1
or MITE (legacy decode pipeline) to the Microcode CounterMask=1
Sequencer. EDGE
79H 30H IDQ.MS_UOPS Counts the total number of uops delivered by the
Microcode Sequencer (MS). Any instruction over 4 uops
will be delivered by the MS. Some instructions such as
transcendentals may additionally generate uops from
the MS.
80H 04H ICACHE_16B.IFDATA_STALL Cycles where a code line fetch is stalled due to an L1
instruction cache miss. The legacy decode pipeline
works at a 16 Byte granularity.
83H 01H ICACHE_64B.IFTAG_HIT Instruction fetch tag lookups that hit in the instruction
cache (L1I). Counts at 64-byte cache-line granularity.
83H 02H ICACHE_64B.IFTAG_MISS Instruction fetch tag lookups that miss in the
instruction cache (L1I). Counts at 64-byte cache-line
granularity.
83H 04H ICACHE_64B.IFTAG_STALL Cycles where a code fetch is stalled due to L1
instruction cache tag miss.
85H 01H ITLB_MISSES.MISS_CAUSES_A_ Counts page walks of any page size (4K/2M/4M/1G)
WALK caused by a code fetch. This implies it missed in the
ITLB and further levels of TLB, but the walk need not
have completed.
85H 02H ITLB_MISSES.WALK_COMPLETE Counts completed page walks (4K page size) caused by
D_4K a code fetch. This implies it missed in the ITLB and
further levels of TLB. The page walk can end with or
without a fault.
85H 04H ITLB_MISSES.WALK_COMPLETE Counts completed page walks of any page size
D_2M_4M (4K/2M/4M/1G) caused by a code fetch. This implies it
missed in the ITLB and further levels of TLB. The page
walk can end with or without a fault.
85H 08H ITLB_MISSES.WALK_COMPLETE Counts store misses in all DTLB levels that cause a
D_1G completed page walk (1G page size). The page walk can
end with or without a fault.
85H 0EH ITLB_MISSES.WALK_COMPLETE Counts completed page walks (2M and 4M page sizes)
D caused by a code fetch. This implies it missed in the
ITLB and further levels of TLB. The page walk can end
with or without a fault.
Vol. 3B 19-13
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
85H 10H ITLB_MISSES.WALK_PENDING Counts 1 per cycle for each PMH that is busy with a
page walk for an instruction fetch request. EPT page
walk duration are excluded in Skylake
microarchitecture.
85H 10H ITLB_MISSES.WALK_ACTIVE Cycles when at least one PMH is busy with a page walk CounterMask=1
for code (instruction fetch) request. EPT page walk
duration are excluded in Skylake microarchitecture.
85H 20H ITLB_MISSES.STLB_HIT Instruction fetch requests that miss the ITLB and hit
the STLB.
87H 01H ILD_STALL.LCP Counts cycles that the Instruction Length decoder (ILD)
stalls occurred due to dynamically changing prefix
length of the decoded instruction (by operand size
prefix instruction 0x66, address size prefix instruction
0x67 or REX.W for Intel64). Count is proportional to the
number of prefixes in a 16B-line. This may result in a
three-cycle penalty for each LCP (Length changing
prefix) in a 16-byte chunk.
9CH 01H IDQ_UOPS_NOT_DELIVERED.CO Counts the number of uops not delivered to Resource
RE Allocation Table (RAT) per thread adding “4 – x” ? when
Resource Allocation Table (RAT) is not stalled and
Instruction Decode Queue (IDQ) delivers x uops to
Resource Allocation Table (RAT) (where x belongs to
{0,1,2,3}). Counting does not cover cases when: a. IDQ-
Resource Allocation Table (RAT) pipe serves the other
thread. b. Resource Allocation Table (RAT) is stalled for
the thread (including uop drops and clear BE
conditions). c. Instruction Decode Queue (IDQ) delivers
four uops.
9CH 01H IDQ_UOPS_NOT_DELIVERED.CYC Counts, on the per-thread basis, cycles when no uops CounterMask=4
LES_0_UOPS_DELIV.CORE are delivered to Resource Allocation Table (RAT). CMSK4
IDQ_Uops_Not_Delivered.core =4.
9CH 01H IDQ_UOPS_NOT_DELIVERED.CYC Counts, on the per-thread basis, cycles when less than CounterMask=3
LES_LE_1_UOP_DELIV.CORE 1 uop is delivered to Resource Allocation Table (RAT). CMSK3
IDQ_Uops_Not_Delivered.core >= 3.
9CH 01H IDQ_UOPS_NOT_DELIVERED.CYC Cycles with less than 2 uops delivered by the front end. CounterMask=2
LES_LE_2_UOP_DELIV.CORE CMSK2
9CH 01H IDQ_UOPS_NOT_DELIVERED.CYC Cycles with less than 3 uops delivered by the front end. CounterMask=1
LES_LE_3_UOP_DELIV.CORE CMSK1
9CH 01H IDQ_UOPS_NOT_DELIVERED.CYC Counts cycles FE delivered 4 uops or Resource CounterMask=1
LES_FE_WAS_OK Allocation Table (RAT) was stalling FE. Invert=1 CMSK, INV
A1H 01H UOPS_DISPATCHED_PORT.PORT Counts, on the per-thread basis, cycles during which at
_0 least one uop is dispatched from the Reservation
Station (RS) to port 0.
A1H 02H UOPS_DISPATCHED_PORT.PORT Counts, on the per-thread basis, cycles during which at
_1 least one uop is dispatched from the Reservation
Station (RS) to port 1.
A1H 04H UOPS_DISPATCHED_PORT.PORT Counts, on the per-thread basis, cycles during which at
_2 least one uop is dispatched from the Reservation
Station (RS) to port 2.
19-14 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
A1H 08H UOPS_DISPATCHED_PORT.PORT Counts, on the per-thread basis, cycles during which at
_3 least one uop is dispatched from the Reservation
Station (RS) to port 3.
A1H 10H UOPS_DISPATCHED_PORT.PORT Counts, on the per-thread basis, cycles during which at
_4 least one uop is dispatched from the Reservation
Station (RS) to port 4.
A1H 20H UOPS_DISPATCHED_PORT.PORT Counts, on the per-thread basis, cycles during which at
_5 least one uop is dispatched from the Reservation
Station (RS) to port 5.
A1H 40H UOPS_DISPATCHED_PORT.PORT Counts, on the per-thread basis, cycles during which at
_6 least one uop is dispatched from the Reservation
Station (RS) to port 6.
A1H 80H UOPS_DISPATCHED_PORT.PORT Counts, on the per-thread basis, cycles during which at
_7 least one uop is dispatched from the Reservation
Station (RS) to port 7.
A2H 01H RESOURCE_STALLS.ANY Counts resource-related stall cycles. Reasons for stalls
can be as follows: a. *any* u-arch structure got full (LB,
SB, RS, ROB, BOB, LM, Physical Register Reclaim Table
(PRRT), or Physical History Table (PHT) slots). b. *any*
u-arch structure got empty (like INT/SIMD FreeLists). c.
FPU control word (FPCW), MXCSR.and others. This
counts cycles that the pipeline back end blocked uop
delivery from the front end.
A2H 08H RESOURCE_STALLS.SB Counts allocation stall cycles caused by the store buffer
(SB) being full. This counts cycles that the pipeline back
end blocked uop delivery from the front end.
A3H 01H CYCLE_ACTIVITY.CYCLES_L2_MI Cycles while L2 cache miss demand load is outstanding. CounterMask=1
SS CMSK1
A3H 02H CYCLE_ACTIVITY.CYCLES_L3_MI Cycles while L3 cache miss demand load is outstanding. CounterMask=2
SS CMSK2
A3H 04H CYCLE_ACTIVITY.STALLS_TOTAL Total execution stalls. CounterMask=4
CMSK4
A3H 05H CYCLE_ACTIVITY.STALLS_L2_MI Execution stalls while L2 cache miss demand load is CounterMask=5
SS outstanding. CMSK5
A3H 06H CYCLE_ACTIVITY.STALLS_L3_MI Execution stalls while L3 cache miss demand load is CounterMask=6
SS outstanding. CMSK6
A3H 08H CYCLE_ACTIVITY.CYCLES_L1D_M Cycles while L1 cache miss demand load is outstanding. CounterMask=8
ISS CMSK8
A3H 0CH CYCLE_ACTIVITY.STALLS_L1D_M Execution stalls while L1 cache miss demand load is CounterMask=12
ISS outstanding. CMSK12
A3H 10H CYCLE_ACTIVITY.CYCLES_MEM_ Cycles while memory subsystem has an outstanding CounterMask=16
ANY load. CMSK16
A3H 14H CYCLE_ACTIVITY.STALLS_MEM_ Execution stalls while memory subsystem has an CounterMask=20
ANY outstanding load. CMSK20
A6H 01H EXE_ACTIVITY.EXE_BOUND_0_P Counts cycles during which no uops were executed on
ORTS all ports and Reservation Station (RS) was not empty.
Vol. 3B 19-15
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
A6H 02H EXE_ACTIVITY.1_PORTS_UTIL Counts cycles during which a total of 1 uop was
executed on all ports and Reservation Station (RS) was
not empty.
A6H 04H EXE_ACTIVITY.2_PORTS_UTIL Counts cycles during which a total of 2 uops were
executed on all ports and Reservation Station (RS) was
not empty.
A6H 08H EXE_ACTIVITY.3_PORTS_UTIL Cycles total of 3 uops are executed on all ports and
Reservation Station (RS) was not empty.
A6H 10H EXE_ACTIVITY.4_PORTS_UTIL Cycles total of 4 uops are executed on all ports and
Reservation Station (RS) was not empty.
A6H 40H EXE_ACTIVITY.BOUND_ON_STO Cycles where the Store Buffer was full and no
RES outstanding load.
A8H 01H LSD.UOPS Number of uops delivered to the back-end by the LSD
(Loop Stream Detector).
A8H 01H LSD.CYCLES_ACTIVE Counts the cycles when at least one uop is delivered by CounterMask=1
the LSD (Loop-stream detector). CMSK1
A8H 01H LSD.CYCLES_4_UOPS Counts the cycles when 4 uops are delivered by the CounterMask=4
LSD (Loop-stream detector). CMSK4
ABH 02H DSB2MITE_SWITCHES.PENALTY Counts Decode Stream Buffer (DSB)-to-MITE switch
_CYCLES true penalty cycles. These cycles do not include uops
routed through because of the switch itself, for
example, when Instruction Decode Queue (IDQ) pre-
allocation is unavailable, or Instruction Decode Queue
(IDQ) is full. SBD-to-MITE switch true penalty cycles
happen after the merge mux (MM) receives Decode
Stream Buffer (DSB) Sync-indication until receiving the
first MITE uop. MM is placed before Instruction Decode
Queue (IDQ) to merge uops being fed from the MITE
and Decode Stream Buffer (DSB) paths. Decode Stream
Buffer (DSB) inserts the Sync-indication whenever a
Decode Stream Buffer (DSB)-to-MITE switch
occurs.Penalty: A Decode Stream Buffer (DSB) hit
followed by a Decode Stream Buffer (DSB) miss can
cost up to six cycles in which no uops are delivered to
the IDQ. Most often, such switches from the Decode
Stream Buffer (DSB) to the legacy pipeline cost 0 to 2
cycles.
AEH 01H ITLB.ITLB_FLUSH Counts the number of flushes of the big or small ITLB
pages. Counting include both TLB Flush (covering all
sets) and TLB Set Clear (set-specific).
B0H 01H OFFCORE_REQUESTS.DEMAND_ Counts the Demand Data Read requests sent to uncore.
DATA_RD Use it in conjunction with
OFFCORE_REQUESTS_OUTSTANDING to determine
average latency in the uncore.
B0H 02H OFFCORE_REQUESTS.DEMAND_ Counts both cacheable and non-cacheable code read
CODE_RD requests.
B0H 04H OFFCORE_REQUESTS.DEMAND_ Counts the demand RFO (read for ownership) requests
RFO including regular RFOs, locks, ItoM.
19-16 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
B0H 08H OFFCORE_REQUESTS.ALL_DATA Counts the demand and prefetch data reads. All Core
_RD Data Reads include cacheable 'Demands' and L2
prefetchers (not L3 prefetchers). Counting also covers
reads due to page walks resulted from any request
type.
B0H 10H OFFCORE_REQUESTS.L3_MISS_ Demand Data Read requests who miss L3 cache.
DEMAND_DATA_RD
B0H 80H OFFCORE_REQUESTS.ALL_REQU Counts memory transactions reached the super queue
ESTS including requests initiated by the core, all L3
prefetches, page walks, etc.
B1H 01H UOPS_EXECUTED.THREAD Number of uops to be executed per-thread each cycle.
B1H 01H UOPS_EXECUTED.STALL_CYCLE Counts cycles during which no uops were dispatched CounterMask=1
S from the Reservation Station (RS) per thread. Invert=1 CMSK, INV
B1H 01H UOPS_EXECUTED.CYCLES_GE_1 Cycles where at least 1 uop was executed per-thread. CounterMask=1
_UOP_EXEC CMSK1
B1H 01H UOPS_EXECUTED.CYCLES_GE_2 Cycles where at least 2 uops were executed per-thread. CounterMask=2
_UOPS_EXEC CMSK2
B1H 01H UOPS_EXECUTED.CYCLES_GE_3 Cycles where at least 3 uops were executed per-thread. CounterMask=3
_UOPS_EXEC CMSK3
B1H 01H UOPS_EXECUTED.CYCLES_GE_4 Cycles where at least 4 uops were executed per-thread. CounterMask=4
_UOPS_EXEC CMSK4
B1H 02H UOPS_EXECUTED.CORE Number of uops executed from any thread.
B1H 02H UOPS_EXECUTED.CORE_CYCLES Cycles at least 1 micro-op is executed from any thread CounterMask=1
_GE_1 on physical core. CMSK1
B1H 02H UOPS_EXECUTED.CORE_CYCLES Cycles at least 2 micro-op is executed from any thread CounterMask=2
_GE_2 on physical core. CMSK2
B1H 02H UOPS_EXECUTED.CORE_CYCLES Cycles at least 3 micro-op is executed from any thread CounterMask=3
_GE_3 on physical core. CMSK3
B1H 02H UOPS_EXECUTED.CORE_CYCLES Cycles at least 4 micro-op is executed from any thread CounterMask=4
_GE_4 on physical core. CMSK4
B1H 02H UOPS_EXECUTED.CORE_CYCLES Cycles with no micro-ops executed from any thread on CounterMask=1
_NONE physical core. Invert=1 CMSK1, INV
B1H 10H UOPS_EXECUTED.X87 Counts the number of x87 uops executed.
B2H 01H OFFCORE_REQUESTS_BUFFER.S Counts the number of cases when the offcore requests
Q_FULL buffer cannot take more entries for the core. This can
happen when the superqueue does not contain eligible
entries, or when L1D writeback pending FIFO requests
is full. Note: Writeback pending FIFO has six entries.
BDH 01H TLB_FLUSH.DTLB_THREAD Counts the number of DTLB flush attempts of the
thread-specific entries.
BDH 20H TLB_FLUSH.STLB_ANY Counts the number of any STLB flush attempts (such as
entire, VPID, PCID, InvPage, CR3 write, etc.).
C0H 00H INST_RETIRED.ANY_P Counts the number of instructions (EOMs) retired. See Table 19-1.
Counting covers macro-fused instructions individually
(that is, increments by two).
Vol. 3B 19-17
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
C0H 01H INST_RETIRED.PREC_DIST A version of INST_RETIRED that allows for a more Precise event capable
unbiased distribution of samples across instructions Requires PEBS on
retired. It utilizes the Precise Distribution of General Counter
Instructions Retired (PDIR) feature to mitigate some 1(PDIR).
bias in how retired instructions get sampled.
C1H 3FH OTHER_ASSISTS.ANY Number of times a microcode assist is invoked by HW
other than FP-assist. Examples include AD (page Access
Dirty) and AVX* related assists.
C2H 01H UOPS_RETIRED.STALL_CYCLES This is a non-precise version (that is, does not use CounterMask=1
PEBS) of the event that counts cycles without actually Invert=1 CMSK1, INV
retired uops.
C2H 01H UOPS_RETIRED.TOTAL_CYCLES Number of cycles using always true condition (uops_ret CounterMask=10
< 16) applied to non PEBS uops retired event. Invert=1 CMSK10,
INV
C2H 02H UOPS_RETIRED.RETIRE_SLOTS Counts the retirement slots used.
C3H 01H MACHINE_CLEARS.COUNT Number of machine clears (nukes) of any type. EdgeDetect=1
CounterMask=1
CMSK1, EDG
C3H 02H MACHINE_CLEARS.MEMORY_OR Counts the number of memory ordering Machine Clears
DERING detected. Memory Ordering Machine Clears can result
from one of the following: a. memory disambiguation, b.
external snoop, or c. cross SMT-HW-thread snoop
(stores) hitting load buffer.
C3H 04H MACHINE_CLEARS.SMC Counts self-modifying code (SMC) detected, which
causes a machine clear.
C4H 00H BR_INST_RETIRED.ALL_BRANC Counts all (macro) branch instructions retired. Precise event capable.
HES See Table 19-1.
C4H 01H BR_INST_RETIRED.CONDITIONA This is a non-precise version (that is, does not use Precise event capable.
L PEBS) of the event that counts conditional branch PS
instructions retired.
C4H 02H BR_INST_RETIRED.NEAR_CALL This is a non-precise version (that is, does not use Precise event capable.
PEBS) of the event that counts both direct and indirect PS
near call instructions retired.
C4H 08H BR_INST_RETIRED.NEAR_RETU This is a non-precise version (that is, does not use Precise event capable.
RN PEBS) of the event that counts return instructions PS
retired.
C4H 10H BR_INST_RETIRED.NOT_TAKEN This is a non-precise version (that is, does not use
PEBS) of the event that counts not taken branch
instructions retired.
C4H 20H BR_INST_RETIRED.NEAR_TAKE This is a non-precise version (that is, does not use Precise event capable.
N PEBS) of the event that counts taken branch PS
instructions retired.
C4H 40H BR_INST_RETIRED.FAR_BRANC This is a non-precise version (that is, does not use Precise event capable.
H PEBS) of the event that counts far branch instructions PS
retired.
19-18 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
C5H 00H BR_MISP_RETIRED.ALL_BRANC Counts all the retired branch instructions that were Precise event capable.
HES mispredicted by the processor. A branch misprediction See Table 19-1.
occurs when the processor incorrectly predicts the
destination of the branch. When the misprediction is
discovered at execution, all the instructions executed in
the wrong (speculative) path must be discarded, and
the processor must start fetching from the correct
path.
C5H 01H BR_MISP_RETIRED.CONDITIONA This is a non-precise version (that is, does not use Precise event capable.
L PEBS) of the event that counts mispredicted conditional PS
branch instructions retired.
C5H 02H BR_MISP_RETIRED.NEAR_CALL Counts both taken and not taken retired mispredicted Precise event capable.
direct and indirect near calls, including both register and
memory indirect.
C5H 20H BR_MISP_RETIRED.NEAR_TAKE Number of near branch instructions retired that were Precise event capable.
N mispredicted and taken. PS
C6H 01H FRONTEND_RETIRED.DSB_MISS Counts retired Instructions that experienced DSB Precise event capable.
(Decode stream buffer, i.e. the decoded instruction-
cache) miss.
C6H 01H FRONTEND_RETIRED.L1I_MISS Retired Instructions who experienced Instruction L1 Precise event capable.
Cache true miss.
C6H 01H FRONTEND_RETIRED.L2_MISS Retired Instructions who experienced Instruction L2 Precise event capable.
Cache true miss.
C6H 01H FRONTEND_RETIRED.ITLB_MISS Counts retired Instructions that experienced iTLB Precise event capable.
(Instruction TLB) true miss.
C6H 01H FRONTEND_RETIRED.STLB_MIS Counts retired Instructions that experienced STLB (2nd Precise event capable.
S level TLB) true miss.
C6H 01H FRONTEND_RETIRED.LATENCY_ Retired instructions that are fetched after an interval Precise event capable.
GE_2 where the front end delivered no uops for a period of 2
cycles which was not interrupted by a back-end stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Retired instructions that are fetched after an interval Precise event capable.
GE_4 where the front end delivered no uops for a period of 4
cycles which was not interrupted by a back-end stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are delivered to the Precise event capable.
GE_8 back end after a front-end stall of at least 8 cycles.
During this period the front end delivered no uops.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are delivered to the Precise event capable.
GE_16 back end after a front-end stall of at least 16 cycles.
During this period the front end delivered no uops.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are delivered to the Precise event capable.
GE_32 back end after a front-end stall of at least 32 cycles.
During this period the front end delivered no uops.
C6H 01H FRONTEND_RETIRED.LATENCY_ Retired instructions that are fetched after an interval Precise event capable.
GE_64 where the front end delivered no uops for a period of
64 cycles which was not interrupted by a back-end
stall.
Vol. 3B 19-19
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
C6H 01H FRONTEND_RETIRED.LATENCY_ Retired instructions that are fetched after an interval Precise event capable.
GE_128 where the front end delivered no uops for a period of
128 cycles which was not interrupted by a back-end
stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Retired instructions that are fetched after an interval Precise event capable.
GE_256 where the front end delivered no uops for a period of
256 cycles which was not interrupted by a back-end
stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Retired instructions that are fetched after an interval Precise event capable.
GE_512 where the front end delivered no uops for a period of
512 cycles which was not interrupted by a back-end
stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are delivered to the Precise event capable.
GE_2_BUBBLES_GE_1 back end after the front end had at least 1 bubble-slot
for a period of 2 cycles. A bubble-slot is an empty issue-
pipeline slot while there was no RAT stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Retired instructions that are fetched after an interval Precise event capable.
GE_2_BUBBLES_GE_2 where the front end had at least 2 bubble-slots for a
period of 2 cycles which was not interrupted by a back-
end stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Retired instructions that are fetched after an interval Precise event capable.
GE_2_BUBBLES_GE_3 where the front end had at least 3 bubble-slots for a
period of 2 cycles which was not interrupted by a back-
end stall.
C7H 01H FP_ARITH_INST_RETIRED.SCAL Number of SSE/AVX computational scalar double Software may treat
AR_DOUBLE precision floating-point instructions retired. Each count each count as one DP
represents 1 computation. Applies to SSE* and AVX* FLOP.
scalar double precision floating-point instructions: ADD
SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB.
FM(N)ADD/SUB instructions count twice as they
perform multiple calculations per element.
C7H 02H FP_ARITH_INST_RETIRED.SCAL Number of SSE/AVX computational scalar single Software may treat
AR_SINGLE precision floating-point instructions retired. Each count each count as one SP
represents 1 computation. Applies to SSE* and AVX* FLOP.
scalar single precision floating-point instructions: ADD
SUB MUL DIV MIN MAX RCP RSQRT SQRT
FM(N)ADD/SUB. FM(N)ADD/SUB instructions count
twice as they perform multiple calculations per
element.
C7H 04H FP_ARITH_INST_RETIRED.128B Number of SSE/AVX computational 128-bit packed Software may treat
_PACKED_DOUBLE double precision floating-point instructions retired. each count as two DP
Each count represents 2 computations. Applies to SSE* FLOPs.
and AVX* packed double precision floating-point
instructions: ADD SUB MUL DIV MIN MAX SQRT DPP
FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions
count twice as they perform multiple calculations per
element.
19-20 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
C7H 08H FP_ARITH_INST_RETIRED.128B Number of SSE/AVX computational 128-bit packed Software may treat
_PACKED_SINGLE single precision floating-point instructions retired. Each each count as four SP
count represents 4 computations. Applies to SSE* and FLOPs.
AVX* packed single precision floating-point
instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT
SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB
instructions count twice as they perform multiple
calculations per element.
C7H 10H FP_ARITH_INST_RETIRED.256B Number of SSE/AVX computational 256-bit packed Software may treat
_PACKED_DOUBLE double precision floating-point instructions retired. each count as four DP
Each count represents 4 computations. Applies to SSE* FLOPs.
and AVX* packed double precision floating-point
instructions: ADD SUB MUL DIV MIN MAX SQRT DPP
FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions
count twice as they perform multiple calculations per
element.
C7H 20H FP_ARITH_INST_RETIRED.256B Number of SSE/AVX computational 256-bit packed Software may treat
_PACKED_SINGLE single precision floating-point instructions retired. Each each count as eight
count represents 8 computations. Applies to SSE* and SP FLOPs.
AVX* packed single precision floating-point
instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT
SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB
instructions count twice as they perform multiple
calculations per element.
C7H 40H FP_ARITH_INST_RETIRED.512B Number of Packed Double-Precision FP arithmetic Only applicable when
_PACKED_DOUBLE instructions (use operation multiplier of 8). AVX-512 is enabled.
C7H 80H FP_ARITH_INST_RETIRED.512B Number of Packed Single-Precision FP arithmetic Only applicable when
_PACKED_SINGLE instructions (use operation multiplier of 16). AVX-512 is enabled.
C8H 01H HLE_RETIRED.START Number of times we entered an HLE region. Does not
count nested transactions.
C8H 02H HLE_RETIRED.COMMIT Number of times HLE commit succeeded.
C8H 04H HLE_RETIRED.ABORTED Number of times HLE abort was triggered. Precise event capable.
C8H 08H HLE_RETIRED.ABORTED_MEM Number of times an HLE execution aborted due to
various memory events (e.g., read/write capacity and
conflicts).
C8H 10H HLE_RETIRED.ABORTED_TIMER Number of times an HLE execution aborted due to
hardware timer expiration.
C8H 20H HLE_RETIRED.ABORTED_UNFRI Number of times an HLE execution aborted due to HLE-
ENDLY unfriendly instructions and certain unfriendly events
(such as AD assists etc.).
C8H 40H HLE_RETIRED.ABORTED_MEMT Number of times an HLE execution aborted due to
YPE incompatible memory type.
C8H 80H HLE_RETIRED.ABORTED_EVENT Number of times an HLE execution aborted due to
S unfriendly events (such as interrupts).
C9H 01H RTM_RETIRED.START Number of times we entered an RTM region. Does not
count nested transactions.
C9H 02H RTM_RETIRED.COMMIT Number of times RTM commit succeeded.
C9H 04H RTM_RETIRED.ABORTED Number of times RTM abort was triggered. Precise event capable.
Vol. 3B 19-21
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
C9H 08H RTM_RETIRED.ABORTED_MEM Number of times an RTM execution aborted due to
various memory events (e.g. read/write capacity and
conflicts).
C9H 10H RTM_RETIRED.ABORTED_TIMER Number of times an RTM execution aborted due to
uncommon conditions.
C9H 20H RTM_RETIRED.ABORTED_UNFRI Number of times an RTM execution aborted due to
ENDLY HLE-unfriendly instructions.
C9H 40H RTM_RETIRED.ABORTED_MEMT Number of times an RTM execution aborted due to
YPE incompatible memory type.
C9H 80H RTM_RETIRED.ABORTED_EVENT Number of times an RTM execution aborted due to
S none of the previous 4 categories (e.g. interrupt).
CAH 1EH FP_ASSIST.ANY Counts cycles with any input and output SSE or x87 FP CounterMask=1
assist. If an input and output assist are detected on the CMSK1
same cycle the event increments by 1.
CBH 01H HW_INTERRUPTS.RECEIVED Counts the number of hardware interruptions received
by the processor.
CCH 20H ROB_MISC_EVENTS.LBR_INSERT Increments when an entry is added to the Last Branch
S Record (LBR) array (or removed from the array in case
of RETURNs in call stack mode). The event requires LBR
enable via IA32_DEBUGCTL MSR and branch type
selection via MSR_LBR_SELECT.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts loads when the latency from first dispatch to Precise event capable.
ATENCY_GT_4 completion is greater than 4 cycles. Reported latency Specify threshold in
may be longer than just the memory latency. MSR 3F6H.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts loads when the latency from first dispatch to Precise event capable.
ATENCY_GT_8 completion is greater than 8 cycles. Reported latency Specify threshold in
may be longer than just the memory latency. MSR 3F6H.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts loads when the latency from first dispatch to Precise event capable.
ATENCY_GT_16 completion is greater than 16 cycles. Reported latency Specify threshold in
may be longer than just the memory latency. MSR 3F6H.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts loads when the latency from first dispatch to Precise event capable.
ATENCY_GT_32 completion is greater than 32 cycles. Reported latency Specify threshold in
may be longer than just the memory latency. MSR 3F6H.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts loads when the latency from first dispatch to Precise event capable.
ATENCY_GT_64 completion is greater than 64 cycles. Reported latency Specify threshold in
may be longer than just the memory latency. MSR 3F6H.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts loads when the latency from first dispatch to Precise event capable.
ATENCY_GT_128 completion is greater than 128 cycles. Reported latency Specify threshold in
may be longer than just the memory latency. MSR 3F6H.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts loads when the latency from first dispatch to Precise event capable.
ATENCY_GT_256 completion is greater than 256 cycles. Reported latency Specify threshold in
may be longer than just the memory latency. MSR 3F6H.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts loads when the latency from first dispatch to Precise event capable.
ATENCY_GT_512 completion is greater than 512 cycles. Reported latency Specify threshold in
may be longer than just the memory latency. MSR 3F6H.
D0H 11H MEM_INST_RETIRED.STLB_MISS Retired load instructions that miss the STLB. Precise event capable.
_LOADS PSDLA
19-22 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
D0H 12H MEM_INST_RETIRED.STLB_MISS Retired store instructions that miss the STLB. Precise event capable.
_STORES PSDLA
D0H 21H MEM_INST_RETIRED.LOCK_LOA Retired load instructions with locked access. Precise event capable.
DS PSDLA
D0H 41H MEM_INST_RETIRED.SPLIT_LOA Counts retired load instructions that split across a Precise event capable.
DS cacheline boundary. PSDLA
D0H 42H MEM_INST_RETIRED.SPLIT_STO Counts retired store instructions that split across a Precise event capable.
RES cacheline boundary. PSDLA
D0H 81H MEM_INST_RETIRED.ALL_LOAD All retired load instructions. Precise event capable.
S PSDLA
D0H 82H MEM_INST_RETIRED.ALL_STOR All retired store instructions. Precise event capable.
ES PSDLA
D1H 01H MEM_LOAD_RETIRED.L1_HIT Counts retired load instructions with at least one uop Precise event capable.
that hit in the L1 data cache. This event includes all SW PSDLA
prefetches and lock instructions regardless of the data
source.
D1H 02H MEM_LOAD_RETIRED.L2_HIT Retired load instructions with L2 cache hits as data Precise event capable.
sources. PSDLA
D1H 04H MEM_LOAD_RETIRED.L3_HIT Counts retired load instructions with at least one uop Precise event capable.
that hit in the L3 cache. PSDLA
D1H 08H MEM_LOAD_RETIRED.L1_MISS Counts retired load instructions with at least one uop Precise event capable.
that missed in the L1 cache. PSDLA
D1H 10H MEM_LOAD_RETIRED.L2_MISS Retired load instructions missed L2 cache as data Precise event capable.
sources. PSDLA
D1H 20H MEM_LOAD_RETIRED.L3_MISS Counts retired load instructions with at least one uop Precise event capable.
that missed in the L3 cache. PSDLA
D1H 40H MEM_LOAD_RETIRED.FB_HIT Counts retired load instructions with at least one uop Precise event capable.
was load missed in L1 but hit FB (Fill Buffers) due to PSDLA
preceding miss to the same cache line with data not
ready.
D2H 01H MEM_LOAD_L3_HIT_RETIRED.X Retired load instructions which data sources were L3 Precise event capable.
SNP_MISS hit and cross-core snoop missed in on-pkg core cache. PSDLA
D2H 02H MEM_LOAD_L3_HIT_RETIRED.X Retired load instructions which data sources were L3 Precise event capable.
SNP_HIT and cross-core snoop hits in on-pkg core cache. PSDLA
D2H 04H MEM_LOAD_L3_HIT_RETIRED.X Retired load instructions which data sources were HitM Precise event capable.
SNP_HITM responses from shared L3. PSDLA
D2H 08H MEM_LOAD_L3_HIT_RETIRED.X Retired load instructions which data sources were hits Precise event capable.
SNP_NONE in L3 without snoops required. PSDLA
D3H 01H MEM_LOAD_L3_MISS_RETIRED. Retired load instructions which data sources missed L3 Precise event capable.
LOCAL_DRAM but serviced from local DRAM.
D3H 02H MEM_LOAD_L3_MISS_RETIRED. Retired load instructions which data sources missed L3 Precise event capable.
REMOTE_DRAM but serviced from remote dram.
D3H 04H MEM_LOAD_L3_MISS_RETIRED. Retired load instructions whose data sources was Precise event capable.
REMOTE_HITM remote HITM.
D3H 08H MEM_LOAD_L3_MISS_RETIRED. Retired load instructions whose data sources was
REMOTE_FWD forwarded from a remote cache.
Vol. 3B 19-23
PERFORMANCE MONITORING EVENTS
Table 19-3. Performance Events of the Processor Core Supported in Intel® Xeon® Processor Scalable Family with
Skylake Microarchitecture and Processors Based on the Cascade Lake Product (Contd.)
Event Umask Event Mask Mnemonic Description Comment
Num. Value
D4H 04H MEM_LOAD_MISC_RETIRED.UC Retired instructions with at least 1 uncacheable load or Precise event capable.
lock.
E6H 01H BACLEARS.ANY Counts the number of times the front-end is resteered
when it finds a branch instruction in a fetch line. This
occurs for the first time a branch instruction is fetched
or when the branch is not tracked by the BPU (Branch
Prediction Unit) anymore.
F0H 40H L2_TRANS.L2_WB Counts L2 writebacks that access L2 cache.
F1H 1FH L2_LINES_IN.ALL Counts the number of L2 cache lines filling the L2.
Counting does not cover rejects.
F2H 01H L2_LINES_OUT.SILENT Counts the number of lines that are silently dropped by
L2 cache when triggered by an L2 cache fill. These lines
are typically in Shared state. A non-threaded event.
F2H 02H L2_LINES_OUT.NON_SILENT Counts the number of lines that are evicted by L2 cache
when triggered by an L2 cache fill. Those lines can be
either in modified state or clean state. Modified lines
may either be written back to L3 or directly written to
memory and not allocated in L3. Clean lines may either
be allocated in L3 or dropped.
F2H 04H L2_LINES_OUT.USELESS_PREF Counts the number of lines that have been hardware
prefetched but not used and now evicted by L2 cache.
F2H 04H L2_LINES_OUT.USELESS_HWPF Counts the number of lines that have been hardware
prefetched but not used and now evicted by L2 cache.
F4H 10H SQ_MISC.SPLIT_LOCK Counts the number of cache line split locks sent to the
uncore.
FEH 02H IDI_MISC.WB_UPGRADE Counts number of cache lines that are allocated and
written back to L3 with the intention that they are
more likely to be reused shortly.
FEH 04H IDI_MISC.WB_DOWNGRADE Counts number of cache lines that are dropped and not
written back to L3 as they are deemed to be less likely
to be reused shortly.
Table 19-4 lists performance events applicable to processors based on the Cascade Lake product.
Table 19-4. Performance Event Addendum in Processors Based on the Cascade Lake Product
Event Umask
Num. Value Event Mask Mnemonic Description Comment
D1H 80H MEM_LOAD_RETIRED.LOCAL_ Retired load instructions with local Intel® Optane™ DC Precise event
PMM persistent memory as the data source and the data capable.
request missed L3 (AppDirect or Memory Mode), and
DRAM cache (Memory Mode).
D3H 10H MEM_LOAD_L3_MISS_RETIRED. Retired load instructions with remote Intel® Optane™ Precise event
REMOTE_PMM persistent memory as the data source and the data capable.
request missed L3 (AppDirect or Memory Mode) and
DRAM cache (Memory Mode).
19-24 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture
Event Umask
Num. Value Event Mask Mnemonic Description Comment
00H 01H INST_RETIRED.ANY Counts the number of X86 instructions retired - an
Architectural PerfMon event. Counting continues during
hardware interrupts, traps, and inside interrupt
handlers. Notes: INST_RETIRED.ANY is counted by a
designated fixed counter freeing up programmable
counters to count other events. INST_RETIRED.ANY_P
is counted by a programmable counter.
00H 01H INST_RETIRED.PREC_DIST A version of INST_RETIRED that allows for a more
unbiased distribution of samples across instructions
retired. It utilizes the Precise Distribution of
Instructions Retired (PDIR) feature to mitigate some
bias in how retired instructions get sampled. Use on
Fixed Counter 0.
00H 02H CPU_CLK_UNHALTED.THREAD Counts the number of core cycles while the thread is
not in a halt state. The thread enters the halt state
when it is running the HLT instruction. This event is a
component in many key event ratios. The core
frequency may change from time to time due to
transitions associated with Enhanced Intel SpeedStep
Technology or TM2. For this reason this event may
have a changing ratio with regards to time. When the
core frequency is constant, this event can approximate
elapsed time while the core was not in the halt state. It
is counted on a dedicated fixed counter, leaving the
four (eight when Hyperthreading is disabled)
programmable counters available for other events.
Vol. 3B 19-25
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
00H 03H CPU_CLK_UNHALTED.REF_TSC Counts the number of reference cycles when the core is
not in a halt state. The core enters the halt state when
it is running the HLT instruction or the MWAIT
instruction. This event is not affected by core
frequency changes (for example, P states, TM2
transitions) but has the same incrementing frequency
as the time stamp counter. This event can approximate
elapsed time while the core was not in a halt state. This
event has a constant ratio with the
CPU_CLK_UNHALTED.REF_XCLK event. It is counted on
a dedicated fixed counter, leaving the four (eight when
Hyperthreading is disabled) programmable counters
available for other events. Note: On all current
platforms this event stops counting during 'throttling
(TM)' states duty off periods the processor is 'halted'.
The counter update is done at a lower clock rate then
the core clock the overflow status bit for this counter
may appear 'sticky'. After the counter has overflowed
and software clears the overflow status bit and resets
the counter to less than MAX. The reset value to the
counter is not clocked immediately so the overflow
status bit will flip 'high (1)' and generate another PMI (if
enabled) after which the reset value gets clocked into
the counter. Therefore, software will get the interrupt,
read the overflow status bit '1 for bit 34 while the
counter value is less than MAX. Software should ignore
this case.
00H 04H TOPDOWN.SLOTS Counts the number of available slots for an unhalted
logical processor. The event increments by machine-
width of the narrowest pipeline as employed by the
Top-down Microarchitecture Analysis method. The
count is distributed among unhalted logical processors
(hyper-threads) who share the same physical core.
Software can use this event as the denominator for the
top-level metrics of the Top-down Microarchitecture
Analysis method. This event is counted on a designated
fixed counter (Fixed Counter 3) and is an architectural
event.
19-26 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
03H 02H LD_BLOCKS.STORE_FORWARD Counts the number of times the load operation got the
true Block-on-Store blocking code preventing store
forwarding. This includes cases when: a. preceding
store conflicts with the load (incomplete overlap), b.
store forwarding is impossible due to u-arch limitations,
c. preceding lock RMW operations are not forwarded, d.
store has the no-forward bit set (uncacheable/page-
split/masked stores), e. all-blocking stores are used
(mostly, fences and port I/O), and others. The most
common case is a load blocked due to its address range
overlapping with a preceding smaller uncompleted
store. Note: This event does not take into account cases
of out-of-SW-control (for example, SbTailHit), unknown
physical STA, and cases of blocking loads on store due
to being non-WB memory type or a lock. These cases
are covered by other events. See the table of not
supported store forwards in the Optimization Guide.
03H 08H LD_BLOCKS.NO_SR Counts the number of times that split load operations
are temporarily blocked because all resources for
handling the split accesses are in use.
07H 01H LD_BLOCKS_PARTIAL.ADDRESS Counts the number of times a load got blocked due to
_ALIAS false dependencies in MOB due to partial compare on
address.
08H 02H DTLB_LOAD_MISSES.WALK_COM Counts page walks completed due to demand data
PLETED_4K loads whose address translations missed in the TLB and
were mapped to 4K pages. The page walks can end
with or without a page fault.
08H 04H DTLB_LOAD_MISSES.WALK_COM Counts page walks completed due to demand data
PLETED_2M_4M loads whose address translations missed in the TLB and
were mapped to 2M/4M pages. The page walks can end
with or without a page fault.
08H 0EH DTLB_LOAD_MISSES.WALK_COM Counts demand data loads that caused a completed
PLETED page walk of any page size (4K/2M/4M/1G). This implies
it missed in all TLB levels. The page walk can end with
or without a fault.
08H 10H DTLB_LOAD_MISSES.WALK_PEN Counts the number of page walks outstanding for a
DING demand load in the PMH (Page Miss Handler) each cycle.
08H 10H DTLB_LOAD_MISSES.WALK_ACT Counts cycles when at least one PMH (Page Miss
IVE Handler) is busy with a page walk for a demand load.
08H 20H DTLB_LOAD_MISSES.STLB_HIT Counts loads that miss the DTLB (Data TLB) and hit the
STLB (Second level TLB).
0DH 01H INT_MISC.RECOVERY_CYCLES Counts core cycles when the Resource allocator was
stalled due to recovery from an earlier branch
misprediction or machine clear event.
0DH 03H INT_MISC.ALL_RECOVERY_CYCL Counts cycles the back end cluster is recovering after a
ES miss-speculation or a Store Buffer or Load Buffer drain
stall.
0DH 80H INT_MISC.CLEAR_RESTEER_CYC Cycles after recovery from a branch misprediction or
LES machine clear till the first uop is issued from the
resteered path.
Vol. 3B 19-27
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
0EH 01H UOPS_ISSUED.ANY Counts the number of uops that the Resource
Allocation Table (RAT) issues to the Reservation Station
(RS).
0EH 01H UOPS_ISSUED.STALL_CYCLES Counts cycles during which the Resource Allocation
Table (RAT) does not issue any Uops to the reservation
station (RS) for the current thread.
14H 09H ARITH.DIVIDER_ACTIVE Counts cycles when divide unit is busy executing divide
or square root operations. Accounts for integer and
floating-point operations.
24H 21H L2_RQSTS.DEMAND_DATA_RD_ Counts the number of demand Data Read requests that
MISS miss L2 cache. Only not rejected loads are counted.
24H 22H L2_RQSTS.RFO_MISS Counts the RFO (Read-for-Ownership) requests that
miss L2 cache.
24H 24H L2_RQSTS.CODE_RD_MISS Counts L2 cache misses when fetching instructions.
24H 27H L2_RQSTS.ALL_DEMAND_MISS Counts demand requests that miss L2 cache.
24H 28H L2_RQSTS.SWPF_MISS Counts Software prefetch requests that miss the L2
cache. This event accounts for PREFETCHNTA and
PREFETCHT0/1/2 instructions.
24H C1H L2_RQSTS.DEMAND_DATA_RD_ Counts the number of demand Data Read requests
HIT initiated by load instructions that hit L2 cache.
24H C2H L2_RQSTS.RFO_HIT Counts the RFO (Read-for-Ownership) requests that hit
L2 cache.
24H C4H L2_RQSTS.CODE_RD_HIT Counts L2 cache hits when fetching instructions, code
reads.
24H C8H L2_RQSTS.SWPF_HIT Counts Software prefetch requests that hit the L2
cache. This event accounts for PREFETCHNTA and
PREFETCHT0/1/2 instructions.
24H E1H L2_RQSTS.ALL_DEMAND_DATA Counts the number of demand Data Read requests
_RD (including requests from L1D hardware prefetchers).
These loads may hit or miss L2 cache. Only non-
rejected loads are counted.
24H E2H L2_RQSTS.ALL_RFO Counts the total number of RFO (read for ownership)
requests to L2 cache. L2 RFO requests include both
L1D demand RFO misses as well as L1D RFO
prefetches.
24H E4H L2_RQSTS.ALL_CODE_RD Counts the total number of L2 code requests.
24H E7H L2_RQSTS.ALL_DEMAND_REFE Counts demand requests to L2 cache.
RENCES
28H 07H CORE_POWER.LVL0_TURBO_LIC Counts Core cycles where the core was running with
ENSE power-delivery for baseline license level 0. This
includes non-AVX codes, SSE, AVX 128-bit, and low-
current AVX 256-bit codes.
28H 18H CORE_POWER.LVL1_TURBO_LIC Counts Core cycles where the core was running with
ENSE power-delivery for license level 1. This includes high
current AVX 256-bit instructions as well as low current
AVX 512-bit instructions.
19-28 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
28H 20H CORE_POWER.LVL2_TURBO_LIC Core cycles where the core was running with power-
ENSE delivery for license level 2 (introduced in Skylake
Server microarchitecture). This includes high current
AVX 512-bit instructions.
32H 01H SW_PREFETCH_ACCESS.NTA Counts the number of PREFETCHNTA instructions
executed.
32H 02H SW_PREFETCH_ACCESS.T0 Counts the number of PREFETCHT0 instructions
executed.
32H 04H SW_PREFETCH_ACCESS.T1_T2 Counts the number of PREFETCHT1 or PREFETCHT2
instructions executed.
32H 08H SW_PREFETCH_ACCESS.PREFET Counts the number of PREFETCHW instructions
CHW executed.
3CH 00H CPU_CLK_UNHALTED.THREAD_ This is an architectural event that counts the number of
P thread cycles while the thread is not in a halt state. The
thread enters the halt state when it is running the HLT
instruction. The core frequency may change from time
to time due to power or thermal throttling. For this
reason, this event may have a changing ratio with
regards to wall clock time.
3CH 01H CPU_CLK_UNHALTED.REF_XCLK Counts core crystal clock cycles when the thread is
unhalted.
3CH 02H CPU_CLK_UNHALTED.ONE_THR Counts Core crystal clock cycles when current thread is
EAD_ACTIVE unhalted and the other thread is halted.
48H 01H L1D_PEND_MISS.PENDING Counts number of L1D misses that are outstanding in
each cycle, that is each cycle the number of Fill Buffers
(FB) outstanding required by Demand Reads. FB either
is held by demand loads, or it is held by non-demand
loads and gets hit at least once by demand. The valid
outstanding interval is defined until the FB deallocation
by one of the following ways: from FB allocation, if FB is
allocated by demand from the demand Hit FB, if it is
allocated by hardware or software prefetch. Note: In
the L1D, a Demand Read contains cacheable or
noncacheable demand loads, including ones causing
cache-line splits and reads due to page walks resulted
from any request type.
48H 01H L1D_PEND_MISS.PENDING_CYCL Counts duration of L1D miss outstanding in cycles.
ES
48H 02H L1D_PEND_MISS.FB_FULL Counts number of cycles a demand request has waited
due to L1D Fill Buffer (FB) unavailability. Demand
requests include cacheable/uncacheable demand load,
store, lock or SW prefetch accesses.
48H 02H L1D_PEND_MISS.FB_FULL_PERI Counts number of phases a demand request has waited
ODS due to L1D Fill Buffer (FB) unavailability. Demand
requests include cacheable/uncacheable demand load,
store, lock or SW prefetch accesses.
48H 04H L1D_PEND_MISS.L2_STALL Counts number of cycles a demand request has waited
due to L1D due to lack of L2 resources. Demand
requests include cacheable/uncacheable demand load,
store, lock or SW prefetch accesses.
Vol. 3B 19-29
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
49H 02H DTLB_STORE_MISSES.WALK_CO Counts page walks completed due to demand data
MPLETED_4K stores whose address translations missed in the TLB
and were mapped to 4K pages. The page walks can end
with or without a page fault.
49H 04H DTLB_STORE_MISSES.WALK_CO Counts page walks completed due to demand data
MPLETED_2M_4M stores whose address translations missed in the TLB
and were mapped to 2M/4M pages. The page walks can
end with or without a page fault.
49H 0EH DTLB_STORE_MISSES.WALK_CO Counts demand data stores that caused a completed
MPLETED page walk of any page size (4K/2M/4M/1G). This implies
it missed in all TLB levels. The page walk can end with
or without a fault.
49H 10H DTLB_STORE_MISSES.WALK_PE Counts the number of page walks outstanding for a
NDING store in the PMH (Page Miss Handler) each cycle.
49H 10H DTLB_STORE_MISSES.WALK_AC Counts cycles when at least one PMH (Page Miss
TIVE Handler) is busy with a page walk for a store.
49H 20H DTLB_STORE_MISSES.STLB_HIT Counts stores that miss the DTLB (Data TLB) and hit
the STLB (2nd Level TLB).
4CH 01H LOAD_HIT_PREFETCH.SWPF Counts all not software-prefetch load dispatches that
hit the fill buffer (FB) allocated for the software
prefetch. It can also be incremented by some lock
instructions. So it should only be used with profiling so
that the locks can be excluded by ASM (Assembly File)
inspection of the nearby instructions.
51H 01H L1D.REPLACEMENT Counts L1D data line replacements including
opportunistic replacements, and replacements that
require stall-for-replace or block-for-replace.
54H 01H TX_MEM.ABORT_CONFLICT Counts the number of times a TSX line had a cache
conflict.
54H 02H TX_MEM.ABORT_CAPACITY_WRI Speculatively counts the number Transactional
TE Synchronization Extensions (TSX) Aborts due to a data
capacity limitation for transactional writes.
54H 04H TX_MEM.ABORT_HLE_STORE_T Counts the number of times a TSX Abort was triggered
O_ELIDED_LOCK due to a non-release/commit store to lock.
54H 08H TX_MEM.ABORT_HLE_ELISION_ Counts the number of times a TSX Abort was triggered
BUFFER_NOT_EMPTY due to commit but Lock Buffer not empty.
54H 10H TX_MEM.ABORT_HLE_ELISION_ Counts the number of times a TSX Abort was triggered
BUFFER_MISMATCH due to release/commit but data and address mismatch.
54H 20H TX_MEM.ABORT_HLE_ELISION_ Counts the number of times a TSX Abort was triggered
BUFFER_UNSUPPORTED_ALIGN due to attempting an unsupported alignment from Lock
MENT Buffer.
54H 40H TX_MEM.HLE_ELISION_BUFFER Counts the number of times we could not allocate Lock
_FULL Buffer.
5DH 02H TX_EXEC.MISC2 Counts unfriendly TSX abort triggered by a vzeroupper
instruction.
5DH 04H TX_EXEC.MISC3 Counts unfriendly TSX abort triggered by a nest count
that is too deep.
19-30 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
5EH 01H RS_EVENTS.EMPTY_CYCLES Counts cycles during which the Reservation Station
(RS) is empty for this logical processor. This is usually
caused when the front-end pipeline runs into starvation
periods (e.g., branch mispredictions or i-cache misses).
5EH 01H RS_EVENTS.EMPTY_END Counts end of periods where the Reservation Station
(RS) was empty. Could be useful to closely sample on
front-end latency issues (see the FRONTEND_RETIRED
event of designated precise events).
60H 04H OFFCORE_REQUESTS_OUTSTAN Counts the number of offcore outstanding demand rfo
DING.CYCLES_WITH_DEMAND_R Reads transactions in the super queue every cycle. The
FO 'Offcore outstanding' state of the transaction lasts
from the L2 miss until the sending transaction
completion to requestor (SQ deallocation). See the
corresponding Umask under OFFCORE_REQUESTS.
60H 08H OFFCORE_REQUESTS_OUTSTAN Counts the number of offcore outstanding cacheable
DING.ALL_DATA_RD Core Data Read transactions in the super queue every
cycle. A transaction is considered to be in the Offcore
outstanding state between L2 miss and transaction
completion sent to requestor (SQ de-allocation). See
corresponding Umask under OFFCORE_REQUESTS.
60H 08H OFFCORE_REQUESTS_OUTSTAN Counts cycles when offcore outstanding cacheable Core
DING.CYCLES_WITH_DATA_RD Data Read transactions are present in the super queue.
A transaction is considered to be in the Offcore
outstanding state between L2 miss and transaction
completion sent to requestor (SQ de-allocation). See
corresponding Umask under OFFCORE_REQUESTS.
79H 04H IDQ.MITE_UOPS Counts the number of uops delivered to Instruction
Decode Queue (IDQ) from the MITE path. This also
means that uops are not being delivered from the
Decode Stream Buffer (DSB).
79H 04H IDQ.MITE_CYCLES_OK Counts the number of cycles where optimal number of
uops was delivered to the Instruction Decode Queue
(IDQ) from the MITE (legacy decode pipeline) path.
During these cycles uops are not being delivered from
the Decode Stream Buffer (DSB).
79H 04H IDQ.MITE_CYCLES_ANY Counts the number of cycles uops were delivered to
the Instruction Decode Queue (IDQ) from the MITE
(legacy decode pipeline) path. During these cycles uops
are not being delivered from the Decode Stream Buffer
(DSB).
79H 08H IDQ.DSB_UOPS Counts the number of uops delivered to Instruction
Decode Queue (IDQ) from the Decode Stream Buffer
(DSB) path.
79H 08H IDQ.DSB_CYCLES_OK Counts the number of cycles where optimal number of
uops was delivered to the Instruction Decode Queue
(IDQ) from the MITE (legacy decode pipeline) path.
During these cycles uops are not being delivered from
the Decode Stream Buffer (DSB).
79H 08H IDQ.DSB_CYCLES_ANY Counts the number of cycles uops were delivered to
Instruction Decode Queue (IDQ) from the Decode
Stream Buffer (DSB) path.
Vol. 3B 19-31
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
79H 30H IDQ.MS_SWITCHES Number of switches from DSB (Decode Stream Buffer)
or MITE (legacy decode pipeline) to the Microcode
Sequencer.
79H 30H IDQ.MS_UOPS Counts the total number of uops delivered by the
Microcode Sequencer (MS). Any instruction over 4 uops
will be delivered by the MS. Some instructions such as
transcendentals may additionally generate uops from
the MS.
79H 30H IDQ.MS_CYCLES_ANY Counts cycles during which uops are being delivered to
Instruction Decode Queue (IDQ) while the Microcode
Sequencer (MS) is busy. Uops maybe initiated by
Decode Stream Buffer (DSB) or MITE.
80H 04H ICACHE_16B.IFDATA_STALL Counts cycles where a code line fetch is stalled due to
an L1 instruction cache miss. The legacy decode
pipeline works at a 16 Byte granularity.
83H 01H ICACHE_64B.IFTAG_HIT Counts instruction fetch tag lookups that hit in the
instruction cache (L1I). Counts at 64-byte cache-line
granularity. Accounts for both cacheable and
uncacheable accesses.
83H 02H ICACHE_64B.IFTAG_MISS Counts instruction fetch tag lookups that miss in the
instruction cache (L1I). Counts at 64-byte cache-line
granularity. Accounts for both cacheable and
uncacheable accesses.
83H 04H ICACHE_64B.IFTAG_STALL Counts cycles where a code fetch is stalled due to L1
instruction cache tag miss.
85H 02H ITLB_MISSES.WALK_COMPLETE Counts completed page walks (4K page size) caused by
D_4K a code fetch. This implies it missed in the ITLB and
further levels of TLB. The page walk can end with or
without a fault.
85H 04H ITLB_MISSES.WALK_COMPLETE Counts code misses in all ITLB (Instruction TLB) levels
D_2M_4M that caused a completed page walk (2M and 4M page
sizes). The page walk can end with or without a fault.
85H 0EH ITLB_MISSES.WALK_COMPLETE Counts completed page walks (2M and 4M page sizes)
D caused by a code fetch. This implies it missed in the
ITLB (Instruction TLB) and further levels of TLB. The
page walk can end with or without a fault.
85H 10H ITLB_MISSES.WALK_PENDING Counts the number of page walks outstanding for an
outstanding code (instruction fetch) request in the PMH
(Page Miss Handler) each cycle.
85H 10H ITLB_MISSES.WALK_ACTIVE Counts cycles when at least one PMH (Page Miss
Handler) is busy with a page walk for a code
(instruction fetch) request.
85H 20H ITLB_MISSES.STLB_HIT Counts instruction fetch requests that miss the ITLB
(Instruction TLB) and hit the STLB (Second-level TLB).
19-32 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
87H 01H ILD_STALL.LCP Counts cycles that the Instruction Length decoder (ILD)
stalls occurred due to dynamically changing prefix
length of the decoded instruction (by operand size
prefix instruction 0x66, address size prefix instruction
0x67 or REX.W for Intel64). Count is proportional to the
number of prefixes in a 16B-line. This may result in a
three-cycle penalty for each LCP (Length changing
prefix) in a 16-byte chunk.
9CH 01H IDQ_UOPS_NOT_DELIVERED.CO Counts the number of uops not delivered to by the
RE Instruction Decode Queue (IDQ) to the back-end of the
pipeline when there were no back-end stalls. This event
counts for one SMT thread in a given cycle.
9CH 01H IDQ_UOPS_NOT_DELIVERED.CYC Counts the number of cycles when no uops were
LES_0_UOPS_DELIV.CORE delivered by the Instruction Decode Queue (IDQ) to the
back-end of the pipeline when there was no back-end
stalls. This event counts for one SMT thread in a given
cycle.
9CH 01H IDQ_UOPS_NOT_DELIVERED.CYC Counts the number of cycles when the optimal number
LES_FE_WAS_OK of uops were delivered by the Instruction Decode
Queue (IDQ) to the back-end of the pipeline when there
was no back-end stalls. This event counts for one SMT
thread in a given cycle.
A1H 01H UOPS_DISPATCHED.PORT_0 Counts, on the per-thread basis, cycles during which at
least one uop is dispatched from the Reservation
Station (RS) to port 0.
A1H 02H UOPS_DISPATCHED.PORT_1 Counts, on the per-thread basis, cycles during which at
least one uop is dispatched from the Reservation
Station (RS) to port 1.
A1H 04H UOPS_DISPATCHED.PORT_2_3 Counts, on the per-thread basis, cycles during which at
least one uop is dispatched from the Reservation
Station (RS) to ports 2 and 3.
A1H 10H UOPS_DISPATCHED.PORT_4_9 Counts, on the per-thread basis, cycles during which at
least one uop is dispatched from the Reservation
Station (RS) to ports 5 and 9.
A1H 20H UOPS_DISPATCHED.PORT_5 Counts, on the per-thread basis, cycles during which at
least one uop is dispatched from the Reservation
Station (RS) to port 5.
A1H 40H UOPS_DISPATCHED.PORT_6 Counts, on the per-thread basis, cycles during which at
least one uop is dispatched from the Reservation
Station (RS) to port 6.
A1H 80H UOPS_DISPATCHED.PORT_7_8 Counts, on the per-thread basis, cycles during which at
least one uop is dispatched from the Reservation
Station (RS) to ports 7 and 8.
A2H 02H RESOURCE_STALLS.SCOREBOAR Counts cycles where the pipeline is stalled due to
D serializing operations.
A2H 08H RESOURCE_STALLS.SB Counts allocation stall cycles caused by the store buffer
(SB) being full. This counts cycles that the pipeline back-
end blocked uop delivery from the front-end.
A3H 01H CYCLE_ACTIVITY.CYCLES_L2_MI Cycles while L2 cache miss demand load is outstanding.
SS
Vol. 3B 19-33
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
A3H 02H CYCLE_ACTIVITY.CYCLES_L3_MI Cycles while L3 cache miss demand load is outstanding.
SS
A3H 04H CYCLE_ACTIVITY.STALLS_TOTAL Total execution stalls.
A3H 05H CYCLE_ACTIVITY.STALLS_L2_MI Execution stalls while L2 cache miss demand load is
SS outstanding.
A3H 06H CYCLE_ACTIVITY.STALLS_L3_MI Execution stalls while L3 cache miss demand load is
SS outstanding.
A3H 08H CYCLE_ACTIVITY.CYCLES_L1D_M Cycles while L1 cache miss demand load is outstanding.
ISS
A3H 0CH CYCLE_ACTIVITY.STALLS_L1D_M Execution stalls while L1 cache miss demand load is
ISS outstanding.
A3H 10H CYCLE_ACTIVITY.CYCLES_MEM_ Cycles while memory subsystem has an outstanding
ANY load.
A3H 14H CYCLE_ACTIVITY.STALLS_MEM_ Execution stalls while memory subsystem has an
ANY outstanding load.
A4H 01H TOPDOWN.SLOTS_P Counts the number of available slots for an unhalted
logical processor. The event increments by machine-
width of the narrowest pipeline as employed by the
Top-down Microarchitecture Analysis method. The
count is distributed among unhalted logical processors
(hyper-threads) who share the same physical core.
A4H 02H TOPDOWN.BACKEND_BOUND_S Issue slots where no uops were being issued due to
LOTS lack of back end resources.
A6H 02H EXE_ACTIVITY.1_PORTS_UTIL Counts cycles during which a total of 1 uop was
executed on all ports and Reservation Station (RS) was
not empty.
A6H 04H EXE_ACTIVITY.2_PORTS_UTIL Counts cycles during which a total of 2 uops were
executed on all ports and Reservation Station (RS) was
not empty.
A6H 40H EXE_ACTIVITY.BOUND_ON_STO Counts cycles where the Store Buffer was full and no
RES loads caused an execution stall.
A6H 80H EXE_ACTIVITY.EXE_BOUND_0_P Counts cycles during which no uops were executed on
ORTS all ports and Reservation Station (RS) was not empty.
A8H 01H LSD.UOPS Counts the number of uops delivered to the back-end
by the LSD (Loop Stream Detector).
A8H 01H LSD.CYCLES_ACTIVE Counts the cycles when at least one uop is delivered by
the LSD (Loop-stream detector).
A8H 01H LSD.CYCLES_OK Counts the cycles when optimal number of uops is
delivered by the LSD (Loop-stream detector).
ABH 02H DSB2MITE_SWITCHES.PENALTY Decode Stream Buffer (DSB) is a Uop-cache that holds
_CYCLES translations of previously fetched instructions that
were decoded by the legacy x86 decode pipeline
(MITE). This event counts fetch penalty cycles when a
transition occurs from DSB to MITE.
AEH 01H ITLB.ITLB_FLUSH Counts the number of flushes of the big or small ITLB
pages. Counting include both TLB Flush (covering all
sets) and TLB Set Clear (set-specific).
19-34 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
B0H 01H OFFCORE_REQUESTS.DEMAND_ Counts the Demand Data Read requests sent to uncore.
DATA_RD Use it in conjunction with
OFFCORE_REQUESTS_OUTSTANDING to determine
average latency in the uncore.
B0H 04H OFFCORE_REQUESTS.DEMAND_ Counts the demand RFO (read for ownership) requests
RFO including regular RFOs, locks, ItoM.
B0H 08H OFFCORE_REQUESTS.ALL_DATA Counts the demand and prefetch data reads. All Core
_RD Data Reads include cacheable 'Demands' and L2
prefetchers (not L3 prefetchers). Counting also covers
reads due to page walks resulted from any request
type.
B0H 10H OFFCORE_REQUESTS.L3_MISS_ Demand Data Read requests who miss L3 cache.
DEMAND_DATA_RD
B0H 80H OFFCORE_REQUESTS.ALL_REQU Counts memory transactions reached the super queue
ESTS including requests initiated by the core, all L3
prefetches, page walks, etc.
B1H 01H UOPS_EXECUTED.THREAD Counts the number of uops to be executed per-thread
each cycle.
B1H 01H UOPS_EXECUTED.STALL_CYCLE Counts cycles during which no uops were dispatched
S from the Reservation Station (RS) per thread.
B1H 01H UOPS_EXECUTED.CYCLES_GE_1 Cycles where at least 1 uop was executed per-thread.
B1H 01H UOPS_EXECUTED.CYCLES_GE_2 Cycles where at least 2 uops were executed per-thread.
B1H 01H UOPS_EXECUTED.CYCLES_GE_3 Cycles where at least 3 uops were executed per-thread.
B1H 01H UOPS_EXECUTED.CYCLES_GE_4 Cycles where at least 4 uops were executed per-thread.
B1H 02H UOPS_EXECUTED.CORE Counts the number of uops executed from any thread.
B1H 02H UOPS_EXECUTED.CORE_CYCLES Counts cycles when at least 1 micro-op is executed
_GE_1 from any thread on physical core.
B1H 02H UOPS_EXECUTED.CORE_CYCLES Counts cycles when at least 2 micro-op is executed
_GE_2 from any thread on physical core.
B1H 02H UOPS_EXECUTED.CORE_CYCLES Counts cycles when at least 3 micro-op is executed
_GE_3 from any thread on physical core.
B1H 02H UOPS_EXECUTED.CORE_CYCLES Counts cycles when at least 4 micro-op is executed
_GE_4 from any thread on physical core.
B1H 10H UOPS_EXECUTED.X87 Counts the number of x87 uops executed.
BDH 01H TLB_FLUSH.DTLB_THREAD Counts the number of DTLB flush attempts of the
thread-specific entries.
BDH 20H TLB_FLUSH.STLB_ANY Counts the number of any STLB flush attempts (such as
entire, VPID, PCID, InvPage, CR3 write, etc.).
C0H 00H INST_RETIRED.ANY_P Counts the number of X86 instructions retired - an
Architectural PerfMon event. Counting continues during
hardware interrupts, traps, and inside interrupt
handlers. Notes: INST_RETIRED.ANY is counted by a
designated fixed counter freeing up programmable
counters to count other events. INST_RETIRED.ANY_P
is counted by a programmable counter.
C1H 02H ASSISTS.FP Counts all microcode Floating Point assists.
Vol. 3B 19-35
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C1H 07H ASSISTS.ANY Counts the number of occurrences where a microcode
assist is invoked by hardware Examples include AD
(page Access Dirty), FP and AVX related assists.
C2H 02H UOPS_RETIRED.TOTAL_CYCLES Counts the number of cycles using always true
condition (uops_ret & amp; lt; 16) applied to non PEBS
uops retired event.
C2H 02H UOPS_RETIRED.SLOTS Counts the retirement slots used each cycle.
C3H 01H MACHINE_CLEARS.COUNT Counts the number of machine clears (nukes) of any
type.
C3H 02H MACHINE_CLEARS.MEMORY_OR Counts the number of Machine Clears detected dye to
DERING memory ordering. Memory Ordering Machine Clears
may apply when a memory read may not conform to
the memory ordering rules of the x86 architecture.
C3H 04H MACHINE_CLEARS.SMC Counts self-modifying code (SMC) detected, which
causes a machine clear.
C4H 00H BR_INST_RETIRED.ALL_BRANC Counts all branch instructions retired.
HES
C4H 01H BR_INST_RETIRED.COND_TAKE Counts taken conditional branch instructions retired.
N
C4H 02H BR_INST_RETIRED.NEAR_CALL Counts both direct and indirect near call instructions
retired.
C4H 08H BR_INST_RETIRED.NEAR_RETU Counts return instructions retired.
RN
C4H 10H BR_INST_RETIRED.COND_NTAK Counts not taken branch instructions retired.
EN
C4H 11H BR_INST_RETIRED.COND Counts conditional branch instructions retired.
C4H 20H BR_INST_RETIRED.NEAR_TAKE Counts taken branch instructions retired.
N
C4H 40H BR_INST_RETIRED.FAR_BRANC Counts far branch instructions retired.
H
C4H 80H BR_INST_RETIRED.INDIRECT Counts all indirect branch instructions retired (excluding
RETs. TSX aborts is considered indirect branch).
C5H 00H BR_MISP_RETIRED.ALL_BRANC Counts all the retired branch instructions that were
HES mispredicted by the processor. A branch misprediction
occurs when the processor incorrectly predicts the
destination of the branch. When the misprediction is
discovered at execution, all the instructions executed in
the wrong (speculative) path must be discarded, and
the processor must start fetching from the correct
path.
C5H 01H BR_MISP_RETIRED.COND_TAKE Counts taken conditional mispredicted branch
N instructions retired.
C5H 11H BR_MISP_RETIRED.COND Counts mispredicted conditional branch instructions
retired.
C5H 20H BR_MISP_RETIRED.NEAR_TAKE Counts number of near branch instructions retired that
N were mispredicted and taken.
19-36 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C5H 80H BR_MISP_RETIRED.INDIRECT Counts all miss-predicted indirect branch instructions
retired (excluding RETs. TSX aborts is considered
indirect branch).
C6H 01H FRONTEND_RETIRED.DSB_MISS Counts retired Instructions that experienced DSB
(Decode stream buffer, i.e., the decoded instruction-
cache) miss.
C6H 01H FRONTEND_RETIRED.L1I_MISS Counts retired Instructions who experienced
Instruction L1 Cache true miss.
C6H 01H FRONTEND_RETIRED.L2_MISS Counts retired Instructions who experienced
Instruction L2 Cache true miss.
C6H 01H FRONTEND_RETIRED.ITLB_MISS Counts retired Instructions that experienced iTLB
(Instruction TLB) true miss.
C6H 01H FRONTEND_RETIRED.STLB_MIS Counts retired Instructions that experienced STLB (2nd
S level TLB) true miss.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are fetched after an
GE_2 interval where the front-end delivered no uops for a
period of 2 cycles which was not interrupted by a back-
end stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are fetched after an
GE_4 interval where the front-end delivered no uops for a
period of 4 cycles which was not interrupted by a back-
end stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are delivered to the
GE_8 back-end after a front-end stall of at least 8 cycles.
During this period the front-end delivered no uops.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are delivered to the
GE_16 back-end after a front-end stall of at least 16 cycles.
During this period the front-end delivered no uops.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are delivered to the
GE_32 back-end after a front-end stall of at least 32 cycles.
During this period the front-end delivered no uops.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are fetched after an
GE_64 interval where the front-end delivered no uops for a
period of 64 cycles which was not interrupted by a
back-end stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are fetched after an
GE_128 interval where the front-end delivered no uops for a
period of 128 cycles which was not interrupted by a
back-end stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are fetched after an
GE_256 interval where the front-end delivered no uops for a
period of 256 cycles which was not interrupted by a
back-end stall.
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are fetched after an
GE_512 interval where the front-end delivered no uops for a
period of 512 cycles which was not interrupted by a
back-end stall.
Vol. 3B 19-37
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C6H 01H FRONTEND_RETIRED.LATENCY_ Counts retired instructions that are delivered to the
GE_2_BUBBLES_GE_1 back-end after the front-end had at least 1 bubble-slot
for a period of 2 cycles. A bubble-slot is an empty issue-
pipeline slot while there was no RAT stall.
C7H 01H FP_ARITH_INST_RETIRED.SCAL Counts number of SSE/AVX computational scalar
AR_DOUBLE double precision floating-point instructions retired;
some instructions will count twice as noted below. Each
count represents 1 computational operation. Applies to
SSE* and AVX* scalar double precision floating-point
instructions: ADD SUB MUL DIV MIN MAX SQRT
FM(N)ADD/SUB. FM(N)ADD/SUB instructions count
twice as they perform 2 calculations per element.
C7H 02H FP_ARITH_INST_RETIRED.SCAL Counts number of SSE/AVX computational scalar single
AR_SINGLE precision floating-point instructions retired; some
instructions will count twice as noted below. Each count
represents 1 computational operation. Applies to SSE*
and AVX* scalar single precision floating-point
instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT
RCP FM(N)ADD/SUB. FM(N)ADD/SUB instructions count
twice as they perform 2 calculations per element.
C7H 04H FP_ARITH_INST_RETIRED.128B Counts number of SSE/AVX computational 128-bit
_PACKED_DOUBLE packed double precision floating-point instructions
retired; some instructions will count twice as noted
below. Each count represents 2 computation
operations, one for each element. Applies to SSE* and
AVX* packed double precision floating-point
instructions: ADD SUB HADD HSUB SUBADD MUL DIV
MIN MAX SQRT DPP FM(N)ADD/SUB. DPP and
FM(N)ADD/SUB instructions count twice as they
perform 2 calculations per element.
C7H 08H FP_ARITH_INST_RETIRED.128B Counts number of SSE/AVX computational 128-bit
_PACKED_SINGLE packed single precision floating-point instructions
retired; some instructions will count twice as noted
below. Each count represents 4 computation
operations, one for each element. Applies to SSE* and
AVX* packed single precision floating-point
instructions: ADD SUB HADD HSUB SUBADD MUL DIV
MIN MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP
and FM(N)ADD/SUB instructions count twice as they
perform 2 calculations per element.
C7H 10H FP_ARITH_INST_RETIRED.256B Counts number of SSE/AVX computational 256-bit
_PACKED_DOUBLE packed double precision floating-point instructions
retired; some instructions will count twice as noted
below. Each count represents 4 computation
operations, one for each element. Applies to SSE* and
AVX* packed double precision floating-point
instructions: ADD SUB HADD HSUB SUBADD MUL DIV
MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB
instructions count twice as they perform 2 calculations
per element.
19-38 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C7H 20H FP_ARITH_INST_RETIRED.256B Counts number of SSE/AVX computational 256-bit
_PACKED_SINGLE packed single precision floating-point instructions
retired; some instructions will count twice as noted
below. Each count represents 8 computation
operations, one for each element. Applies to SSE* and
AVX* packed single precision floating-point
instructions: ADD SUB HADD HSUB SUBADD MUL DIV
MIN MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP
and FM(N)ADD/SUB instructions count twice as they
perform 2 calculations per element.
C7H 40H FP_ARITH_INST_RETIRED.512B Counts number of SSE/AVX computational 512-bit
_PACKED_DOUBLE packed double precision floating-point instructions
retired; some instructions will count twice as noted
below. Each count represents 8 computation
operations, one for each element. Applies to SSE* and
AVX* packed double precision floating-point
instructions: ADD SUB MUL DIV MIN MAX SQRT
RSQRT14 RCP14 RANGE FM(N)ADD/SUB.
FM(N)ADD/SUB instructions count twice as they
perform 2 calculations per element.
C7H 80H FP_ARITH_INST_RETIRED.512B Counts number of SSE/AVX computational 512-bit
_PACKED_SINGLE packed double precision floating-point instructions
retired; some instructions will count twice as noted
below. Each count represents 16 computation
operations, one for each element. Applies to SSE* and
AVX* packed double precision floating-point
instructions: ADD SUB MUL DIV MIN MAX SQRT
RSQRT14 RCP14 RANGE FM(N)ADD/SUB.
FM(N)ADD/SUB instructions count twice as they
perform 2 calculations per element.
C8H 01H HLE_RETIRED.START Counts the number of times we entered an HLE region.
Does not count nested transactions.
C8H 02H HLE_RETIRED.COMMIT Counts the number of times HLE commit succeeded.
C8H 04H HLE_RETIRED.ABORTED Counts the number of times HLE abort was triggered.
C8H 08H HLE_RETIRED.ABORTED_MEM Counts the number of times an HLE execution aborted
due to various memory events (e.g., read/write capacity
and conflicts).
C8H 20H HLE_RETIRED.ABORTED_UNFRI Counts the number of times an HLE execution aborted
ENDLY due to HLE-unfriendly instructions and certain
unfriendly events (such as AD assists etc.).
C8H 80H HLE_RETIRED.ABORTED_EVENT Counts the number of times an HLE execution aborted
S due to unfriendly events (such as interrupts).
C9H 01H RTM_RETIRED.START Counts the number of times we entered an RTM region.
Does not count nested transactions.
C9H 02H RTM_RETIRED.COMMIT Counts the number of times RTM commit succeeded.
C9H 04H RTM_RETIRED.ABORTED Counts the number of times RTM abort was triggered.
C9H 08H RTM_RETIRED.ABORTED_MEM Counts the number of times an RTM execution aborted
due to various memory events (e.g. read/write capacity
and conflicts).
Vol. 3B 19-39
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C9H 20H RTM_RETIRED.ABORTED_UNFRI Counts the number of times an RTM execution aborted
ENDLY due to HLE-unfriendly instructions.
C9H 40H RTM_RETIRED.ABORTED_MEMT Counts the number of times an RTM execution aborted
YPE due to incompatible memory type.
C9H 80H RTM_RETIRED.ABORTED_EVENT Counts the number of times an RTM execution aborted
S due to none of the previous 4 categories (e.g.,
interrupt).
CCH 20H MISC_RETIRED.LBR_INSERTS Increments when an entry is added to the Last Branch
Record (LBR) array (or removed from the array in case
of RETURNs in call stack mode). The event requires LBR
enable via IA32_DEBUGCTL MSR and branch type
selection via MSR_LBR_SELECT.
CCH 40H MISC_RETIRED.PAUSE_INST Counts number of retired PAUSE instructions (that do
not end up with a VM Exit to the VMM; TSX aborted
Instructions may be counted).
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts randomly selected loads when the latency from
ATENCY_GT_4 first dispatch to completion is greater than 4 cycles.
Reported latency may be longer than just the memory
latency.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts randomly selected loads when the latency from
ATENCY_GT_8 first dispatch to completion is greater than 8 cycles.
Reported latency may be longer than just the memory
latency.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts randomly selected loads when the latency from
ATENCY_GT_16 first dispatch to completion is greater than 16 cycles.
Reported latency may be longer than just the memory
latency.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts randomly selected loads when the latency from
ATENCY_GT_32 first dispatch to completion is greater than 32 cycles.
Reported latency may be longer than just the memory
latency.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts randomly selected loads when the latency from
ATENCY_GT_64 first dispatch to completion is greater than 64 cycles.
Reported latency may be longer than just the memory
latency.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts randomly selected loads when the latency from
ATENCY_GT_128 first dispatch to completion is greater than 128 cycles.
Reported latency may be longer than just the memory
latency.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts randomly selected loads when the latency from
ATENCY_GT_256 first dispatch to completion is greater than 256 cycles.
Reported latency may be longer than just the memory
latency.
CDH 01H MEM_TRANS_RETIRED.LOAD_L Counts randomly selected loads when the latency from
ATENCY_GT_512 first dispatch to completion is greater than 512 cycles.
Reported latency may be longer than just the memory
latency.
DOH 11H MEM_INST_RETIRED.STLB_MISS Counts retired load instructions that true miss the
_LOADS STLB.
19-40 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
D0H 12H MEM_INST_RETIRED.STLB_MISS Counts retired store instructions that true miss the
_STORES STLB.
DOH 21H MEM_INST_RETIRED.LOCK_LOA Counts retired load instructions with locked access.
DS
D0H 41H MEM_INST_RETIRED.SPLIT_LOA Counts retired load instructions that split across a
DS cacheline boundary.
DOH 42H MEM_INST_RETIRED.SPLIT_STO Counts retired store instructions that split across a
RES cacheline boundary.
D0H 81H MEM_INST_RETIRED.ALL_LOAD Counts all retired load instructions. This event accounts
S for SW prefetch instructions for loads.
D0H 82H MEM_INST_RETIRED.ALL_STOR Counts all retired store instructions. This event account
ES for SW prefetch instructions and PREFETCHW
instruction for stores.
D1H 01H MEM_LOAD_RETIRED.L1_HIT Counts retired load instructions with at least one uop
that hit in the L1 data cache. This event includes all SW
prefetches and lock instructions regardless of the data
source.
D1H 02H MEM_LOAD_RETIRED.L2_HIT Counts retired load instructions with L2 cache hits as
data sources.
D1H 04H MEM_LOAD_RETIRED.L3_HIT Counts retired load instructions with at least one uop
that hit in the L3 cache.
D1H 08H MEM_LOAD_RETIRED.L1_MISS Counts retired load instructions with at least one uop
that missed in the L1 cache.
D1H 10H MEM_LOAD_RETIRED.L2_MISS Counts retired load instructions missed L2 cache as
data sources.
D1H 20H MEM_LOAD_RETIRED.L3_MISS Counts retired load instructions with at least one uop
that missed in the L3 cache.
D1H 40H MEM_LOAD_RETIRED.FB_HIT Counts retired load instructions with at least one uop
was load missed in L1 but hit FB (Fill Buffers) due to
preceding miss to the same cache line with data not
ready.
D2H 01H MEM_LOAD_L3_HIT_RETIRED.X Counts the retired load instructions whose data sources
SNP_MISS were L3 hit and cross-core snoop missed in on-pkg core
cache.
D2H 02H MEM_LOAD_L3_HIT_RETIRED.X Counts retired load instructions whose data sources
SNP_HIT were L3 and cross-core snoop hits in on-pkg core cache.
D2H 04H MEM_LOAD_L3_HIT_RETIRED.X Counts retired load instructions whose data sources
SNP_HITM were HitM responses from shared L3.
D2H 08H MEM_LOAD_L3_HIT_RETIRED.X Counts retired load instructions whose data sources
SNP_NONE were hits in L3 without snoops required.
E6H 01H BACLEARS.ANY Counts the number of times the front-end is resteered
when it finds a branch instruction in a fetch line. This
occurs for the first time a branch instruction is fetched
or when the branch is not tracked by the BPU (Branch
Prediction Unit) anymore.
Vol. 3B 19-41
PERFORMANCE MONITORING EVENTS
Table 19-5. Performance Events of the Processor Core Supported by Ice Lake Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
ECH 02H CPU_CLK_UNHALTED.DISTRIBU This event distributes cycle counts between active
TED hyperthreads, i.e., those in C0. A hyperthread becomes
inactive when it executes the HLT or MWAIT
instructions. If all other hyperthreads are inactive (or
disabled or do not exist), all counts are attributed to
this hyperthread. To obtain the full count when the
Core is active, sum the counts from each hyperthread.
F1H 1FH L2_LINES_IN.ALL Counts the number of L2 cache lines filling the L2.
Counting does not cover rejects.
F4H 04H SQ_MISC.SQ_FULL Counts the cycles for which the thread is active and the
superQ cannot take any more entries.
CMSK1: Counter Mask = 1 required; CMSK4: CounterMask = 4 required; CMSK6: CounterMask = 6 required; CMSK8: CounterMask = 8
required; CMSK10: CounterMask = 10 required; CMSK12: CounterMask = 12 required; CMSK16: CounterMask = 16 required; CMSK20:
CounterMask = 20 required.
AnyT: AnyThread = 1 required.
INV: Invert = 1 required.
EDG: EDGE = 1 required.
PSDLA: Also supports PEBS and DataLA.
PS: Also supports PEBS.
19-42 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-43
PERFORMANCE MONITORING EVENTS
19-44 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-45
PERFORMANCE MONITORING EVENTS
19-46 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-47
PERFORMANCE MONITORING EVENTS
19-48 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-49
PERFORMANCE MONITORING EVENTS
19-50 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-51
PERFORMANCE MONITORING EVENTS
Table 19-12 lists performance events supporting Intel TSX (see Section 18.3.6.5) and the events are applicable to
processors based on Skylake microarchitecture. Where Skylake microarchitecture implements TSX-related event
semantics that differ from Table 19-12, they are listed in Table 19-7.
Table 19-7. Intel® TSX Performance Event Addendum in Processors based on Skylake Microarchitecture
Event Umask
Num. Value Event Mask Mnemonic Description Comment
54H 02H TX_MEM.ABORT_CAPACITY Number of times a transactional abort was signaled due
to a data capacity limitation for transactional reads or
writes.
19-52 Vol. 3B
PERFORMANCE MONITORING EVENTS
apply to processors with CPUID signature of DisplayFamily_DisplayModel encoding with the following values:
06_57H and 06_85H.
Vol. 3B 19-53
PERFORMANCE MONITORING EVENTS
19-54 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-55
PERFORMANCE MONITORING EVENTS
19-56 Vol. 3B
PERFORMANCE MONITORING EVENTS
19.6 PERFORMANCE MONITORING EVENTS FOR THE INTEL® CORE™ M AND 5TH
GENERATION INTEL® CORE™ PROCESSORS
The Intel® Core™ M processors, the 5th generation Intel® Core™ processors and the Intel Xeon processor E3 1200
v4 product family are based on the Broadwell microarchitecture. They support the architectural performance moni-
toring events listed in Table 19-1. Model-specific performance monitoring events in the processor core are listed in
Table 19-9. The events in Table 19-9 apply to processors with CPUID signature of DisplayFamily_DisplayModel
encoding with the following values: 06_3DH and 06_47H. Table 19-12 lists performance events supporting Intel
TSX (see Section 18.3.6.5) and the events are available on processors based on Broadwell microarchitecture. Fixed
counters in the core PMU support the architecture events defined in Table 19-2.
Model-specific performance monitoring events that are located in the uncore sub-system are implementation
specific between different platforms using processors based on Broadwell microarchitecture and with different
DisplayFamily_DisplayModel signatures. Processors with CPUID signature of DisplayFamily_DisplayModel 06_3DH
and 06_47H support uncore performance events listed in Table 19-13.
Table 19-9. Performance Events of the Processor Core Supported by Broadwell Microarchitecture
Event Umask
Num. Value Event Mask Mnemonic Description Comment
03H 02H LD_BLOCKS.STORE_FORWARD Loads blocked by overlapping with store buffer that
cannot be forwarded.
03H 08H LD_BLOCKS.NO_SR The number of times that split load operations are
temporarily blocked because all resources for
handling the split accesses are in use.
05H 01H MISALIGN_MEM_REF.LOADS Speculative cache-line split load uops dispatched to
L1D.
05H 02H MISALIGN_MEM_REF.STORES Speculative cache-line split store-address uops
dispatched to L1D.
07H 01H LD_BLOCKS_PARTIAL.ADDRESS False dependencies in MOB due to partial compare
_ALIAS on address.
08H 01H DTLB_LOAD_MISSES.MISS_CAUS Load misses in all TLB levels that cause a page walk
ES_A_WALK of any page size.
08H 02H DTLB_LOAD_MISSES.WALK_COM Completed page walks due to demand load misses
PLETED_4K that caused 4K page walks in any TLB levels.
08H 10H DTLB_LOAD_MISSES.WALK_DUR Cycle PMH is busy with a walk.
ATION
08H 20H DTLB_LOAD_MISSES.STLB_HIT_ Load misses that missed DTLB but hit STLB (4K).
4K
0DH 03H INT_MISC.RECOVERY_CYCLES Cycles waiting to recover after Machine Clears Set Edge to count
except JEClear. Set Cmask= 1. occurrences.
0EH 01H UOPS_ISSUED.ANY Increments each cycle the # of uops issued by the Set Cmask = 1, Inv = 1to
RAT to RS. Set Cmask = 1, Inv = 1, Any= 1to count count stalled cycles.
stalled cycles of this core.
0EH 10H UOPS_ISSUED.FLAGS_MERGE Number of flags-merge uops allocated. Such uops
add delay.
0EH 20H UOPS_ISSUED.SLOW_LEA Number of slow LEA or similar uops allocated. Such
uop has 3 sources (for example, 2 sources +
immediate) regardless of whether it is a result of
LEA instruction or not.
0EH 40H UOPS_ISSUED.SiNGLE_MUL Number of multiply packed/scalar single precision
uops allocated.
Vol. 3B 19-57
PERFORMANCE MONITORING EVENTS
Table 19-9. Performance Events of the Processor Core Supported by Broadwell Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
14H 01H ARITH.FPU_DIV_ACTIVE Cycles when divider is busy executing divide
operations.
24H 21H L2_RQSTS.DEMAND_DATA_RD_ Demand data read requests that missed L2, no
MISS rejects.
24H 41H L2_RQSTS.DEMAND_DATA_RD_ Demand data read requests that hit L2 cache.
HIT
24H 50H L2_RQSTS.L2_PF_HIT Counts all L2 HW prefetcher requests that hit L2.
24H 30H L2_RQSTS.L2_PF_MISS Counts all L2 HW prefetcher requests that missed
L2.
24H E1H L2_RQSTS.ALL_DEMAND_DATA Counts any demand and L1 HW prefetch data load
_RD requests to L2.
24H E2H L2_RQSTS.ALL_RFO Counts all L2 store RFO requests.
24H E4H L2_RQSTS.ALL_CODE_RD Counts all L2 code requests.
24H F8H L2_RQSTS.ALL_PF Counts all L2 HW prefetcher requests.
27H 50H L2_DEMAND_RQSTS.WB_HIT Not rejected writebacks that hit L2 cache.
2EH 4FH LONGEST_LAT_CACHE.REFEREN This event counts requests originating from the core See Table 19-1.
CE that reference a cache line in the last level cache.
2EH 41H LONGEST_LAT_CACHE.MISS This event counts each cache miss condition for See Table 19-1.
references to the last level cache.
3CH 00H CPU_CLK_UNHALTED.THREAD_ Counts the number of thread cycles while the thread See Table 19-1.
P is not in a halt state. The thread enters the halt state
when it is running the HLT instruction. The core
frequency may change from time to time due to
power or thermal throttling.
3CH 01H CPU_CLK_THREAD_UNHALTED. Increments at the frequency of XCLK (100 MHz) See Table 19-1.
REF_XCLK when not halted.
48H 01H L1D_PEND_MISS.PENDING Increments the number of outstanding L1D misses Counter 2 only.
every cycle. Set Cmask = 1 and Edge =1 to count Set Cmask = 1 to count
occurrences. cycles.
49H 01H DTLB_STORE_MISSES.MISS_CAU Miss in all TLB levels causes a page walk of any page
SES_A_WALK size (4K/2M/4M/1G).
49H 02H DTLB_STORE_MISSES.WALK_CO Completed page walks due to store misses in one or
MPLETED_4K more TLB levels of 4K page structure.
49H 10H DTLB_STORE_MISSES.WALK_DU Cycles PMH is busy with this walk.
RATION
49H 20H DTLB_STORE_MISSES.STLB_HIT Store misses that missed DTLB but hit STLB (4K).
_4K
4CH 02H LOAD_HIT_PRE.HW_PF Non-SW-prefetch load dispatches that hit fill buffer
allocated for H/W prefetch.
4FH 10H EPT.WALK_CYCLES Cycles of Extended Page Table walks.
51H 01H L1D.REPLACEMENT Counts the number of lines brought into the L1 data
cache.
58H 04H MOVE_ELIMINATION.INT_NOT_E Number of integer move elimination candidate uops
LIMINATED that were not eliminated.
58H 08H MOVE_ELIMINATION.SIMD_NOT_ Number of SIMD move elimination candidate uops
ELIMINATED that were not eliminated.
19-58 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-9. Performance Events of the Processor Core Supported by Broadwell Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
58H 01H MOVE_ELIMINATION.INT_ELIMIN Number of integer move elimination candidate uops
ATED that were eliminated.
58H 02H MOVE_ELIMINATION.SIMD_ELIMI Number of SIMD move elimination candidate uops
NATED that were eliminated.
5CH 01H CPL_CYCLES.RING0 Unhalted core cycles when the thread is in ring 0. Use Edge to count
transition.
5CH 02H CPL_CYCLES.RING123 Unhalted core cycles when the thread is not in ring 0.
5EH 01H RS_EVENTS.EMPTY_CYCLES Cycles the RS is empty for the thread.
60H 01H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding demand data read transactions Use only when HTT is off.
DING.DEMAND_DATA_RD in SQ to uncore. Set Cmask=1 to count cycles.
60H 02H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding demand code read transactions Use only when HTT is off.
DING.DEMAND_CODE_RD in SQ to uncore. Set Cmask=1 to count cycles.
60H 04H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding RFO store transactions in SQ to Use only when HTT is off.
DING.DEMAND_RFO uncore. Set Cmask=1 to count cycles.
60H 08H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding cacheable data read Use only when HTT is off.
DING.ALL_DATA_RD transactions in SQ to uncore. Set Cmask=1 to count
cycles.
63H 01H LOCK_CYCLES.SPLIT_LOCK_UC_ Cycles in which the L1D and L2 are locked, due to a
LOCK_DURATION UC lock or split lock.
63H 02H LOCK_CYCLES.CACHE_LOCK_DU Cycles in which the L1D is locked.
RATION
79H 02H IDQ.EMPTY Counts cycles the IDQ is empty.
79H 04H IDQ.MITE_UOPS Increment each cycle # of uops delivered to IDQ from Can combine Umask 04H
MITE path. Set Cmask = 1 to count cycles. and 20H.
79H 08H IDQ.DSB_UOPS Increment each cycle # of uops delivered to IDQ from Can combine Umask 08H
DSB path. Set Cmask = 1 to count cycles. and 10H.
79H 10H IDQ.MS_DSB_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H,
when MS_busy by DSB. Set Cmask = 1 to count 08H.
cycles. Add Edge=1 to count # of delivery.
79H 20H IDQ.MS_MITE_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H,
when MS_busy by MITE. Set Cmask = 1 to count 08H.
cycles.
79H 30H IDQ.MS_UOPS Increment each cycle # of uops delivered to IDQ from Can combine Umask 04H,
MS by either DSB or MITE. Set Cmask = 1 to count 08H.
cycles.
79H 18H IDQ.ALL_DSB_CYCLES_ANY_UO Counts cycles DSB is delivered at least one uops. Set
PS Cmask = 1.
79H 18H IDQ.ALL_DSB_CYCLES_4_UOPS Counts cycles DSB is delivered four uops. Set Cmask
= 4.
79H 24H IDQ.ALL_MITE_CYCLES_ANY_UO Counts cycles MITE is delivered at least one uop. Set
PS Cmask = 1.
79H 24H IDQ.ALL_MITE_CYCLES_4_UOPS Counts cycles MITE is delivered four uops. Set Cmask
= 4.
79H 3CH IDQ.MITE_ALL_UOPS Number of uops delivered to IDQ from any path.
80H 02H ICACHE.MISSES Number of Instruction Cache, Streaming Buffer and
Victim Cache Misses. Includes UC accesses.
Vol. 3B 19-59
PERFORMANCE MONITORING EVENTS
Table 19-9. Performance Events of the Processor Core Supported by Broadwell Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
85H 01H ITLB_MISSES.MISS_CAUSES_A_ Misses in ITLB that cause a page walk of any page
WALK size.
85H 02H ITLB_MISSES.WALK_COMPLETE Completed page walks due to misses in ITLB 4K page
D_4K entries.
85H 10H ITLB_MISSES.WALK_DURATION Cycle PMH is busy with a walk.
85H 20H ITLB_MISSES.STLB_HIT_4K ITLB misses that hit STLB (4K).
87H 01H ILD_STALL.LCP Stalls caused by changing prefix length of the
instruction.
88H 01H BR_INST_EXEC.COND Qualify conditional near branch instructions Must combine with
executed, but not necessarily retired. umask 40H, 80H.
88H 02H BR_INST_EXEC.DIRECT_JMP Qualify all unconditional near branch instructions Must combine with
excluding calls and indirect branches. umask 80H.
88H 04H BR_INST_EXEC.INDIRECT_JMP_ Qualify executed indirect near branch instructions Must combine with
NON_CALL_RET that are not calls or returns. umask 80H.
88H 08H BR_INST_EXEC.RETURN_NEAR Qualify indirect near branches that have a return Must combine with
mnemonic. umask 80H.
88H 10H BR_INST_EXEC.DIRECT_NEAR_C Qualify unconditional near call branch instructions, Must combine with
ALL excluding non-call branch, executed. umask 80H.
88H 20H BR_INST_EXEC.INDIRECT_NEAR Qualify indirect near calls, including both register and Must combine with
_CALL memory indirect, executed. umask 80H.
88H 40H BR_INST_EXEC.NONTAKEN Qualify non-taken near branches executed. Applicable to umask 01H
only.
88H 80H BR_INST_EXEC.TAKEN Qualify taken near branches executed. Must combine
with 01H,02H, 04H, 08H, 10H, 20H.
88H FFH BR_INST_EXEC.ALL_BRANCHES Counts all near executed branches (not necessarily
retired).
89H 01H BR_MISP_EXEC.COND Qualify conditional near branch instructions Must combine with
mispredicted. umask 40H, 80H.
89H 04H BR_MISP_EXEC.INDIRECT_JMP_ Qualify mispredicted indirect near branch Must combine with
NON_CALL_RET instructions that are not calls or returns. umask 80H.
89H 08H BR_MISP_EXEC.RETURN_NEAR Qualify mispredicted indirect near branches that Must combine with
have a return mnemonic. umask 80H.
89H 10H BR_MISP_EXEC.DIRECT_NEAR_C Qualify mispredicted unconditional near call branch Must combine with
ALL instructions, excluding non-call branch, executed. umask 80H.
89H 20H BR_MISP_EXEC.INDIRECT_NEAR Qualify mispredicted indirect near calls, including Must combine with
_CALL both register and memory indirect, executed. umask 80H.
89H 40H BR_MISP_EXEC.NONTAKEN Qualify mispredicted non-taken near branches Applicable to umask 01H
executed. only.
89H 80H BR_MISP_EXEC.TAKEN Qualify mispredicted taken near branches executed.
Must combine with 01H,02H, 04H, 08H, 10H, 20H.
89H FFH BR_MISP_EXEC.ALL_BRANCHES Counts all near executed branches (not necessarily
retired).
9CH 01H IDQ_UOPS_NOT_DELIVERED.CO Count issue pipeline slots where no uop was Use Cmask to qualify uop
RE delivered from the front end to the back end when b/w.
there is no back end stall.
A1H 01H UOPS_DISPATCHED_PORT.PORT Counts the number of cycles in which a uop is Set AnyThread to count
_0 dispatched to port 0. per core.
19-60 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-9. Performance Events of the Processor Core Supported by Broadwell Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
A1H 02H UOPS_DISPATCHED_PORT.PORT Counts the number of cycles in which a uop is Set AnyThread to count
_1 dispatched to port 1. per core.
A1H 04H UOPS_DISPATCHED_PORT.PORT Counts the number of cycles in which a uop is Set AnyThread to count
_2 dispatched to port 2. per core.
A1H 08H UOPS_DISPATCHED_PORT.PORT Counts the number of cycles in which a uop is Set AnyThread to count
_3 dispatched to port 3. per core.
A1H 10H UOPS_DISPATCHED_PORT.PORT Counts the number of cycles in which a uop is Set AnyThread to count
_4 dispatched to port 4. per core.
A1H 20H UOPS_DISPATCHED_PORT.PORT Counts the number of cycles in which a uop is Set AnyThread to count
_5 dispatched to port 5. per core.
A1H 40H UOPS_DISPATCHED_PORT.PORT Counts the number of cycles in which a uop is Set AnyThread to count
_6 dispatched to port 6. per core.
A1H 80H UOPS_DISPATCHED_PORT.PORT Counts the number of cycles in which a uop is Set AnyThread to count
_7 dispatched to port 7. per core.
A2H 01H RESOURCE_STALLS.ANY Cycles Allocation is stalled due to resource related
reason.
A2H 04H RESOURCE_STALLS.RS Cycles stalled due to no eligible RS entry available.
A2H 08H RESOURCE_STALLS.SB Cycles stalled due to no store buffers available (not
including draining form sync).
A2H 10H RESOURCE_STALLS.ROB Cycles stalled due to re-order buffer full.
A8H 01H LSD.UOPS Number of uops delivered by the LSD.
ABH 02H DSB2MITE_SWITCHES.PENALTY Cycles of delay due to Decode Stream Buffer to MITE
_CYCLES switches.
AEH 01H ITLB.ITLB_FLUSH Counts the number of ITLB flushes; includes
4k/2M/4M pages.
B0H 01H OFFCORE_REQUESTS.DEMAND_ Demand data read requests sent to uncore. Use only when HTT is off.
DATA_RD
B0H 02H OFFCORE_REQUESTS.DEMAND_ Demand code read requests sent to uncore. Use only when HTT is off.
CODE_RD
B0H 04H OFFCORE_REQUESTS.DEMAND_ Demand RFO read requests sent to uncore, including Use only when HTT is off.
RFO regular RFOs, locks, ItoM.
B0H 08H OFFCORE_REQUESTS.ALL_DATA Data read requests sent to uncore (demand and Use only when HTT is off.
_RD prefetch).
B1H 01H UOPS_EXECUTED.THREAD Counts total number of uops to be executed per- Use Cmask to count stall
logical-processor each cycle. cycles.
B1H 02H UOPS_EXECUTED.CORE Counts total number of uops to be executed per-core Do not need to set ANY.
each cycle.
B7H 01H OFF_CORE_RESPONSE_0 See Section 18.3.4.5, “Off-core Response Requires MSR 01A6H.
Performance Monitoring”.
BBH 01H OFF_CORE_RESPONSE_1 See Section 18.3.4.5, “Off-core Response Requires MSR 01A7H.
Performance Monitoring”.
BCH 11H PAGE_WALKER_LOADS.DTLB_L1 Number of DTLB page walker loads that hit in the
L1+FB.
BCH 21H PAGE_WALKER_LOADS.ITLB_L1 Number of ITLB page walker loads that hit in the
L1+FB.
BCH 12H PAGE_WALKER_LOADS.DTLB_L2 Number of DTLB page walker loads that hit in the L2.
Vol. 3B 19-61
PERFORMANCE MONITORING EVENTS
Table 19-9. Performance Events of the Processor Core Supported by Broadwell Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
BCH 22H PAGE_WALKER_LOADS.ITLB_L2 Number of ITLB page walker loads that hit in the L2.
BCH 14H PAGE_WALKER_LOADS.DTLB_L3 Number of DTLB page walker loads that hit in the L3.
BCH 24H PAGE_WALKER_LOADS.ITLB_L3 Number of ITLB page walker loads that hit in the L3.
BCH 18H PAGE_WALKER_LOADS.DTLB_M Number of DTLB page walker loads from memory.
EMORY
C0H 00H INST_RETIRED.ANY_P Number of instructions at retirement. See Table 19-1.
C0H 01H INST_RETIRED.PREC_DIST Precise instruction retired event with HW to reduce PMC1 only.
effect of PEBS shadow in IP distribution.
C0H 02H INST_RETIRED.X87 FP operations retired. X87 FP operations that have
no exceptions.
C1H 08H OTHER_ASSISTS.AVX_TO_SSE Number of transitions from AVX-256 to legacy SSE
when penalty applicable.
C1H 10H OTHER_ASSISTS.SSE_TO_AVX Number of transitions from SSE to AVX-256 when
penalty applicable.
C1H 40H OTHER_ASSISTS.ANY_WB_ASSI Number of microcode assists invoked by HW upon
ST uop writeback.
C2H 01H UOPS_RETIRED.ALL Counts the number of micro-ops retired. Supports PEBS and
Use cmask=1 and invert to count active cycles or DataLA, use Any=1 for
stalled cycles. core granular.
C2H 02H UOPS_RETIRED.RETIRE_SLOTS Counts the number of retirement slots used each Supports PEBS.
cycle.
C3H 01H MACHINE_CLEARS.CYCLES Counts cycles while a machine clears stalled forward
progress of a logical processor or a processor core.
C3H 02H MACHINE_CLEARS.MEMORY_OR Counts the number of machine clears due to memory
DERING order conflicts.
C3H 04H MACHINE_CLEARS.SMC Number of self-modifying-code machine clears
detected.
C3H 20H MACHINE_CLEARS.MASKMOV Counts the number of executed AVX masked load
operations that refer to an illegal address range with
the mask bits set to 0.
C4H 00H BR_INST_RETIRED.ALL_BRANC Branch instructions at retirement. See Table 19-1.
HES
C4H 01H BR_INST_RETIRED.CONDITIONA Counts the number of conditional branch instructions Supports PEBS.
L retired.
C4H 02H BR_INST_RETIRED.NEAR_CALL Direct and indirect near call instructions retired. Supports PEBS.
C4H 04H BR_INST_RETIRED.ALL_BRANC Counts the number of branch instructions retired. Supports PEBS.
HES
C4H 08H BR_INST_RETIRED.NEAR_RETU Counts the number of near return instructions Supports PEBS.
RN retired.
C4H 10H BR_INST_RETIRED.NOT_TAKEN Counts the number of not taken branch instructions
retired.
C4H 20H BR_INST_RETIRED.NEAR_TAKE Number of near taken branches retired. Supports PEBS.
N
C4H 40H BR_INST_RETIRED.FAR_BRANC Number of far branches retired.
H
19-62 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-9. Performance Events of the Processor Core Supported by Broadwell Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C5H 00H BR_MISP_RETIRED.ALL_BRANC Mispredicted branch instructions at retirement. See Table 19-1.
HES
C5H 01H BR_MISP_RETIRED.CONDITIONA Mispredicted conditional branch instructions retired. Supports PEBS.
L
C5H 04H BR_MISP_RETIRED.ALL_BRANC Mispredicted macro branch instructions retired. Supports PEBS.
HES
CAH 02H FP_ASSIST.X87_OUTPUT Number of X87 FP assists due to output values.
CAH 04H FP_ASSIST.X87_INPUT Number of X87 FP assists due to input values.
CAH 08H FP_ASSIST.SIMD_OUTPUT Number of SIMD FP assists due to output values.
CAH 10H FP_ASSIST.SIMD_INPUT Number of SIMD FP assists due to input values.
CAH 1EH FP_ASSIST.ANY Cycles with any input/output SSE* or FP assists.
CCH 20H ROB_MISC_EVENTS.LBR_INSER Count cases of saving new LBR records by hardware.
TS
CDH 01H MEM_TRANS_RETIRED.LOAD_L Randomly sampled loads whose latency is above a Specify threshold in MSR
ATENCY user defined threshold. A small fraction of the overall 3F6H.
loads are sampled due to randomization.
D0H 11H MEM_UOPS_RETIRED.STLB_MIS Retired load uops that miss the STLB. Supports PEBS and
S_LOADS DataLA.
D0H 12H MEM_UOPS_RETIRED.STLB_MIS Retired store uops that miss the STLB. Supports PEBS and
S_STORES DataLA.
D0H 21H MEM_UOPS_RETIRED.LOCK_LOA Retired load uops with locked access. Supports PEBS and
DS DataLA.
D0H 41H MEM_UOPS_RETIRED.SPLIT_LO Retired load uops that split across a cacheline Supports PEBS and
ADS boundary. DataLA.
D0H 42H MEM_UOPS_RETIRED.SPLIT_ST Retired store uops that split across a cacheline Supports PEBS and
ORES boundary. DataLA.
D0H 81H MEM_UOPS_RETIRED.ALL_LOAD All retired load uops. Supports PEBS and
S DataLA.
D0H 82H MEM_UOPS_RETIRED.ALL_STOR All retired store uops. Supports PEBS and
ES DataLA.
D1H 01H MEM_LOAD_UOPS_RETIRED.L1_ Retired load uops with L1 cache hits as data sources. Supports PEBS and
HIT DataLA.
D1H 02H MEM_LOAD_UOPS_RETIRED.L2_ Retired load uops with L2 cache hits as data sources. Supports PEBS and
HIT DataLA.
D1H 04H MEM_LOAD_UOPS_RETIRED.L3_ Retired load uops with L3 cache hits as data sources. Supports PEBS and
HIT DataLA.
D1H 08H MEM_LOAD_UOPS_RETIRED.L1_ Retired load uops missed L1 cache as data sources. Supports PEBS and
MISS DataLA.
D1H 10H MEM_LOAD_UOPS_RETIRED.L2_ Retired load uops missed L2. Unknown data source Supports PEBS and
MISS excluded. DataLA.
D1H 20H MEM_LOAD_UOPS_RETIRED.L3_ Retired load uops missed L3. Excludes unknown data Supports PEBS and
MISS source. DataLA.
D1H 40H MEM_LOAD_UOPS_RETIRED.HIT Retired load uops where data sources were load Supports PEBS and
_LFB uops missed L1 but hit FB due to preceding miss to DataLA.
the same cache line with data not ready.
Vol. 3B 19-63
PERFORMANCE MONITORING EVENTS
Table 19-9. Performance Events of the Processor Core Supported by Broadwell Microarchitecture (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
D2H 01H MEM_LOAD_UOPS_L3_HIT_RETI Retired load uops where data sources were L3 hit Supports PEBS and
RED.XSNP_MISS and cross-core snoop missed in on-pkg core cache. DataLA.
D2H 02H MEM_LOAD_UOPS_L3_HIT_RETI Retired load uops where data sources were L3 and Supports PEBS and
RED.XSNP_HIT cross-core snoop hits in on-pkg core cache. DataLA.
D2H 04H MEM_LOAD_UOPS_L3_HIT_RETI Retired load uops where data sources were HitM Supports PEBS and
RED.XSNP_HITM responses from shared L3. DataLA.
D2H 08H MEM_LOAD_UOPS_L3_HIT_RETI Retired load uops where data sources were hits in Supports PEBS and
RED.XSNP_NONE L3 without snoops required. DataLA.
D3H 01H MEM_LOAD_UOPS_L3_MISS_RE Retired load uops where data sources missed L3 but Supports PEBS and
TIRED.LOCAL_DRAM serviced from local dram. DataLA.
F0H 01H L2_TRANS.DEMAND_DATA_RD Demand data read requests that access L2 cache.
F0H 02H L2_TRANS.RFO RFO requests that access L2 cache.
F0H 04H L2_TRANS.CODE_RD L2 cache accesses when fetching instructions.
F0H 08H L2_TRANS.ALL_PF Any MLC or L3 HW prefetch accessing L2, including
rejects.
F0H 10H L2_TRANS.L1D_WB L1D writebacks that access L2 cache.
F0H 20H L2_TRANS.L2_FILL L2 fill requests that access L2 cache.
F0H 40H L2_TRANS.L2_WB L2 writebacks that access L2 cache.
F0H 80H L2_TRANS.ALL_REQUESTS Transactions accessing L2 pipe.
F1H 01H L2_LINES_IN.I L2 cache lines in I state filling L2. Counting does not cover
rejects.
F1H 02H L2_LINES_IN.S L2 cache lines in S state filling L2. Counting does not cover
rejects.
F1H 04H L2_LINES_IN.E L2 cache lines in E state filling L2. Counting does not cover
rejects.
F1H 07H L2_LINES_IN.ALL L2 cache lines filling L2. Counting does not cover
rejects.
F2H 05H L2_LINES_OUT.DEMAND_CLEAN Clean L2 cache lines evicted by demand.
Table 19-12 lists performance events supporting Intel TSX (see Section 18.3.6.5) and the events are applicable to
processors based on Broadwell microarchitecture. Where Broadwell microarchitecture implements TSX-related
event semantics that differ from Table 19-12, they are listed in Table 19-10.
Table 19-10. Intel® TSX Performance Event Addendum in Processors Based on Broadwell Microarchitecture
Event Umask
Num. Value Event Mask Mnemonic Description Comment
54H 02H TX_MEM.ABORT_CAPACITY Number of times a transactional abort was signaled due
to a data capacity limitation for transactional reads or
writes.
19-64 Vol. 3B
PERFORMANCE MONITORING EVENTS
19-11 apply to processors with CPUID signature of DisplayFamily_DisplayModel encoding with the following
values: 06_3CH, 06_45H and 06_46H. Table 19-12 lists performance events focused on supporting Intel TSX (see
Section 18.3.6.5). Fixed counters in the core PMU support the architecture events defined in Table 19-2.
Additional information on event specifics (e.g., derivative events using specific IA32_PERFEVTSELx modifiers, limi-
tations, special notes and recommendations) can be found at https://software.intel.com/en-us/forums/software-
tuning-performance-optimization-platform-monitoring.
Table 19-11. Performance Events in the Processor Core of 4th Generation Intel® Core™ Processors
Event Umask
Num. Value Event Mask Mnemonic Description Comment
03H 02H LD_BLOCKS.STORE_FORWARD Loads blocked by overlapping with store buffer that
cannot be forwarded.
03H 08H LD_BLOCKS.NO_SR The number of times that split load operations are
temporarily blocked because all resources for
handling the split accesses are in use.
05H 01H MISALIGN_MEM_REF.LOADS Speculative cache-line split load uops dispatched to
L1D.
05H 02H MISALIGN_MEM_REF.STORES Speculative cache-line split store-address uops
dispatched to L1D.
07H 01H LD_BLOCKS_PARTIAL.ADDRESS False dependencies in MOB due to partial compare
_ALIAS on address.
08H 01H DTLB_LOAD_MISSES.MISS_CAUS Misses in all TLB levels that cause a page walk of any
ES_A_WALK page size.
08H 02H DTLB_LOAD_MISSES.WALK_COM Completed page walks due to demand load misses
PLETED_4K that caused 4K page walks in any TLB levels.
08H 04H DTLB_LOAD_MISSES.WALK_COM Completed page walks due to demand load misses
PLETED_2M_4M that caused 2M/4M page walks in any TLB levels.
08H 0EH DTLB_LOAD_MISSES.WALK_COM Completed page walks in any TLB of any page size
PLETED due to demand load misses.
08H 10H DTLB_LOAD_MISSES.WALK_DUR Cycle PMH is busy with a walk.
ATION
08H 20H DTLB_LOAD_MISSES.STLB_HIT_ Load misses that missed DTLB but hit STLB (4K).
4K
08H 40H DTLB_LOAD_MISSES.STLB_HIT_ Load misses that missed DTLB but hit STLB (2M).
2M
08H 60H DTLB_LOAD_MISSES.STLB_HIT Number of cache load STLB hits. No page walk.
08H 80H DTLB_LOAD_MISSES.PDE_CACH DTLB demand load misses with low part of linear-to-
E_MISS physical address translation missed.
0DH 03H INT_MISC.RECOVERY_CYCLES Cycles waiting to recover after Machine Clears Set Edge to count
except JEClear. Set Cmask= 1. occurrences.
0EH 01H UOPS_ISSUED.ANY Increments each cycle the # of uops issued by the Set Cmask = 1, Inv = 1to
RAT to RS. Set Cmask = 1, Inv = 1, Any= 1 to count count stalled cycles.
stalled cycles of this core.
0EH 10H UOPS_ISSUED.FLAGS_MERGE Number of flags-merge uops allocated. Such uops
add delay.
0EH 20H UOPS_ISSUED.SLOW_LEA Number of slow LEA or similar uops allocated. Such
uop has 3 sources (for example, 2 sources +
immediate) regardless of whether it is a result of
LEA instruction or not.
Vol. 3B 19-65
PERFORMANCE MONITORING EVENTS
Table 19-11. Performance Events in the Processor Core of 4th Generation Intel® Core™ Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
0EH 40H UOPS_ISSUED.SiNGLE_MUL Number of multiply packed/scalar single precision
uops allocated.
24H 21H L2_RQSTS.DEMAND_DATA_RD_ Demand data read requests that missed L2, no
MISS rejects.
24H 41H L2_RQSTS.DEMAND_DATA_RD_ Demand data read requests that hit L2 cache.
HIT
24H E1H L2_RQSTS.ALL_DEMAND_DATA Counts any demand and L1 HW prefetch data load
_RD requests to L2.
24H 42H L2_RQSTS.RFO_HIT Counts the number of store RFO requests that hit
the L2 cache.
24H 22H L2_RQSTS.RFO_MISS Counts the number of store RFO requests that miss
the L2 cache.
24H E2H L2_RQSTS.ALL_RFO Counts all L2 store RFO requests.
24H 44H L2_RQSTS.CODE_RD_HIT Number of instruction fetches that hit the L2 cache.
24H 24H L2_RQSTS.CODE_RD_MISS Number of instruction fetches that missed the L2
cache.
24H 27H L2_RQSTS.ALL_DEMAND_MISS Demand requests that miss L2 cache.
24H E7H L2_RQSTS.ALL_DEMAND_REFE Demand requests to L2 cache.
RENCES
24H E4H L2_RQSTS.ALL_CODE_RD Counts all L2 code requests.
24H 50H L2_RQSTS.L2_PF_HIT Counts all L2 HW prefetcher requests that hit L2.
24H 30H L2_RQSTS.L2_PF_MISS Counts all L2 HW prefetcher requests that missed
L2.
24H F8H L2_RQSTS.ALL_PF Counts all L2 HW prefetcher requests.
24H 3FH L2_RQSTS.MISS All requests that missed L2.
24H FFH L2_RQSTS.REFERENCES All requests to L2 cache.
27H 50H L2_DEMAND_RQSTS.WB_HIT Not rejected writebacks that hit L2 cache.
2EH 4FH LONGEST_LAT_CACHE.REFEREN This event counts requests originating from the core See Table 19-1.
CE that reference a cache line in the last level cache.
2EH 41H LONGEST_LAT_CACHE.MISS This event counts each cache miss condition for See Table 19-1.
references to the last level cache.
3CH 00H CPU_CLK_UNHALTED.THREAD_ Counts the number of thread cycles while the thread See Table 19-1.
P is not in a halt state. The thread enters the halt state
when it is running the HLT instruction. The core
frequency may change from time to time due to
power or thermal throttling.
3CH 01H CPU_CLK_THREAD_UNHALTED. Increments at the frequency of XCLK (100 MHz) See Table 19-1.
REF_XCLK when not halted.
48H 01H L1D_PEND_MISS.PENDING Increments the number of outstanding L1D misses Counter 2 only.
every cycle. Set Cmask = 1 and Edge =1 to count Set Cmask = 1 to count
occurrences. cycles.
49H 01H DTLB_STORE_MISSES.MISS_CAU Miss in all TLB levels causes a page walk of any page
SES_A_WALK size (4K/2M/4M/1G).
49H 02H DTLB_STORE_MISSES.WALK_CO Completed page walks due to store misses in one or
MPLETED_4K more TLB levels of 4K page structure.
19-66 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-11. Performance Events in the Processor Core of 4th Generation Intel® Core™ Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
49H 04H DTLB_STORE_MISSES.WALK_CO Completed page walks due to store misses in one or
MPLETED_2M_4M more TLB levels of 2M/4M page structure.
49H 0EH DTLB_STORE_MISSES.WALK_CO Completed page walks due to store miss in any TLB
MPLETED levels of any page size (4K/2M/4M/1G).
49H 10H DTLB_STORE_MISSES.WALK_DU Cycles PMH is busy with this walk.
RATION
49H 20H DTLB_STORE_MISSES.STLB_HIT Store misses that missed DTLB but hit STLB (4K).
_4K
49H 40H DTLB_STORE_MISSES.STLB_HIT Store misses that missed DTLB but hit STLB (2M).
_2M
49H 60H DTLB_STORE_MISSES.STLB_HIT Store operations that miss the first TLB level but hit
the second and do not cause page walks.
49H 80H DTLB_STORE_MISSES.PDE_CAC DTLB store misses with low part of linear-to-physical
HE_MISS address translation missed.
4CH 01H LOAD_HIT_PRE.SW_PF Non-SW-prefetch load dispatches that hit fill buffer
allocated for S/W prefetch.
4CH 02H LOAD_HIT_PRE.HW_PF Non-SW-prefetch load dispatches that hit fill buffer
allocated for H/W prefetch.
51H 01H L1D.REPLACEMENT Counts the number of lines brought into the L1 data
cache.
58H 04H MOVE_ELIMINATION.INT_NOT_E Number of integer move elimination candidate uops
LIMINATED that were not eliminated.
58H 08H MOVE_ELIMINATION.SIMD_NOT_ Number of SIMD move elimination candidate uops
ELIMINATED that were not eliminated.
58H 01H MOVE_ELIMINATION.INT_ELIMIN Number of integer move elimination candidate uops
ATED that were eliminated.
58H 02H MOVE_ELIMINATION.SIMD_ELIMI Number of SIMD move elimination candidate uops
NATED that were eliminated.
5CH 01H CPL_CYCLES.RING0 Unhalted core cycles when the thread is in ring 0. Use Edge to count
transition.
5CH 02H CPL_CYCLES.RING123 Unhalted core cycles when the thread is not in ring 0.
5EH 01H RS_EVENTS.EMPTY_CYCLES Cycles the RS is empty for the thread.
60H 01H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding demand data read transactions Use only when HTT is off.
DING.DEMAND_DATA_RD in SQ to uncore. Set Cmask=1 to count cycles.
60H 02H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding Demand code Read transactions Use only when HTT is off.
DING.DEMAND_CODE_RD in SQ to uncore. Set Cmask=1 to count cycles.
60H 04H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding RFO store transactions in SQ to Use only when HTT is off.
DING.DEMAND_RFO uncore. Set Cmask=1 to count cycles.
60H 08H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding cacheable data read Use only when HTT is off.
DING.ALL_DATA_RD transactions in SQ to uncore. Set Cmask=1 to count
cycles.
63H 01H LOCK_CYCLES.SPLIT_LOCK_UC_ Cycles in which the L1D and L2 are locked, due to a
LOCK_DURATION UC lock or split lock.
63H 02H LOCK_CYCLES.CACHE_LOCK_DU Cycles in which the L1D is locked.
RATION
79H 02H IDQ.EMPTY Counts cycles the IDQ is empty.
Vol. 3B 19-67
PERFORMANCE MONITORING EVENTS
Table 19-11. Performance Events in the Processor Core of 4th Generation Intel® Core™ Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
79H 04H IDQ.MITE_UOPS Increment each cycle # of uops delivered to IDQ from Can combine Umask 04H
MITE path. Set Cmask = 1 to count cycles. and 20H.
79H 08H IDQ.DSB_UOPS Increment each cycle. # of uops delivered to IDQ Can combine Umask 08H
from DSB path. Set Cmask = 1 to count cycles. and 10H.
79H 10H IDQ.MS_DSB_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H,
when MS_busy by DSB. Set Cmask = 1 to count 08H.
cycles. Add Edge=1 to count # of delivery.
79H 20H IDQ.MS_MITE_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H,
when MS_busy by MITE. Set Cmask = 1 to count 08H.
cycles.
79H 30H IDQ.MS_UOPS Increment each cycle # of uops delivered to IDQ from Can combine Umask 04H,
MS by either DSB or MITE. Set Cmask = 1 to count 08H.
cycles.
79H 18H IDQ.ALL_DSB_CYCLES_ANY_UO Counts cycles DSB is delivered at least one uops. Set
PS Cmask = 1.
79H 18H IDQ.ALL_DSB_CYCLES_4_UOPS Counts cycles DSB is delivered four uops. Set Cmask
= 4.
79H 24H IDQ.ALL_MITE_CYCLES_ANY_UO Counts cycles MITE is delivered at least one uop. Set
PS Cmask = 1.
79H 24H IDQ.ALL_MITE_CYCLES_4_UOPS Counts cycles MITE is delivered four uops. Set Cmask
= 4.
79H 3CH IDQ.MITE_ALL_UOPS # of uops delivered to IDQ from any path.
80H 02H ICACHE.MISSES Number of Instruction Cache, Streaming Buffer and
Victim Cache Misses. Includes UC accesses.
85H 01H ITLB_MISSES.MISS_CAUSES_A_ Misses in ITLB that causes a page walk of any page
WALK size.
85H 02H ITLB_MISSES.WALK_COMPLETE Completed page walks due to misses in ITLB 4K page
D_4K entries.
85H 04H ITLB_MISSES.WALK_COMPLETE Completed page walks due to misses in ITLB 2M/4M
D_2M_4M page entries.
85H 0EH ITLB_MISSES.WALK_COMPLETE Completed page walks in ITLB of any page size.
D
85H 10H ITLB_MISSES.WALK_DURATION Cycle PMH is busy with a walk.
85H 20H ITLB_MISSES.STLB_HIT_4K ITLB misses that hit STLB (4K).
85H 40H ITLB_MISSES.STLB_HIT_2M ITLB misses that hit STLB (2M).
85H 60H ITLB_MISSES.STLB_HIT ITLB misses that hit STLB. No page walk.
87H 01H ILD_STALL.LCP Stalls caused by changing prefix length of the
instruction.
87H 04H ILD_STALL.IQ_FULL Stall cycles due to IQ is full.
88H 01H BR_INST_EXEC.COND Qualify conditional near branch instructions Must combine with
executed, but not necessarily retired. umask 40H, 80H.
88H 02H BR_INST_EXEC.DIRECT_JMP Qualify all unconditional near branch instructions Must combine with
excluding calls and indirect branches. umask 80H.
88H 04H BR_INST_EXEC.INDIRECT_JMP_ Qualify executed indirect near branch instructions Must combine with
NON_CALL_RET that are not calls or returns. umask 80H.
19-68 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-11. Performance Events in the Processor Core of 4th Generation Intel® Core™ Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
88H 08H BR_INST_EXEC.RETURN_NEAR Qualify indirect near branches that have a return Must combine with
mnemonic. umask 80H.
88H 10H BR_INST_EXEC.DIRECT_NEAR_C Qualify unconditional near call branch instructions, Must combine with
ALL excluding non-call branch, executed. umask 80H.
88H 20H BR_INST_EXEC.INDIRECT_NEAR Qualify indirect near calls, including both register and Must combine with
_CALL memory indirect, executed. umask 80H.
88H 40H BR_INST_EXEC.NONTAKEN Qualify non-taken near branches executed. Applicable to umask 01H
only.
88H 80H BR_INST_EXEC.TAKEN Qualify taken near branches executed. Must combine
with 01H,02H, 04H, 08H, 10H, 20H.
88H FFH BR_INST_EXEC.ALL_BRANCHES Counts all near executed branches (not necessarily
retired).
89H 01H BR_MISP_EXEC.COND Qualify conditional near branch instructions Must combine with
mispredicted. umask 40H, 80H.
89H 04H BR_MISP_EXEC.INDIRECT_JMP_ Qualify mispredicted indirect near branch Must combine with
NON_CALL_RET instructions that are not calls or returns. umask 80H.
89H 08H BR_MISP_EXEC.RETURN_NEAR Qualify mispredicted indirect near branches that Must combine with
have a return mnemonic. umask 80H.
89H 10H BR_MISP_EXEC.DIRECT_NEAR_C Qualify mispredicted unconditional near call branch Must combine with
ALL instructions, excluding non-call branch, executed. umask 80H.
89H 20H BR_MISP_EXEC.INDIRECT_NEAR Qualify mispredicted indirect near calls, including Must combine with
_CALL both register and memory indirect, executed. umask 80H.
89H 40H BR_MISP_EXEC.NONTAKEN Qualify mispredicted non-taken near branches Applicable to umask 01H
executed. only.
89H 80H BR_MISP_EXEC.TAKEN Qualify mispredicted taken near branches executed.
Must combine with 01H,02H, 04H, 08H, 10H, 20H.
89H FFH BR_MISP_EXEC.ALL_BRANCHES Counts all near executed branches (not necessarily
retired).
9CH 01H IDQ_UOPS_NOT_DELIVERED.CO Count issue pipeline slots where no uop was Use Cmask to qualify uop
RE delivered from the front end to the back end when b/w.
there is no back-end stall.
A1H 01H UOPS_EXECUTED_PORT.PORT_ Cycles which a uop is dispatched on port 0 in this Set AnyThread to count
0 thread. per core.
A1H 02H UOPS_EXECUTED_PORT.PORT_ Cycles which a uop is dispatched on port 1 in this Set AnyThread to count
1 thread. per core.
A1H 04H UOPS_EXECUTED_PORT.PORT_ Cycles which a uop is dispatched on port 2 in this Set AnyThread to count
2 thread. per core.
A1H 08H UOPS_EXECUTED_PORT.PORT_ Cycles which a uop is dispatched on port 3 in this Set AnyThread to count
3 thread. per core.
A1H 10H UOPS_EXECUTED_PORT.PORT_ Cycles which a uop is dispatched on port 4 in this Set AnyThread to count
4 thread. per core.
A1H 20H UOPS_EXECUTED_PORT.PORT_ Cycles which a uop is dispatched on port 5 in this Set AnyThread to count
5 thread. per core.
A1H 40H UOPS_EXECUTED_PORT.PORT_ Cycles which a uop is dispatched on port 6 in this Set AnyThread to count
6 thread. per core.
A1H 80H UOPS_EXECUTED_PORT.PORT_ Cycles which a uop is dispatched on port 7 in this Set AnyThread to count
7 thread per core.
Vol. 3B 19-69
PERFORMANCE MONITORING EVENTS
Table 19-11. Performance Events in the Processor Core of 4th Generation Intel® Core™ Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
A2H 01H RESOURCE_STALLS.ANY Cycles allocation is stalled due to resource related
reason.
A2H 04H RESOURCE_STALLS.RS Cycles stalled due to no eligible RS entry available.
A2H 08H RESOURCE_STALLS.SB Cycles stalled due to no store buffers available (not
including draining form sync).
A2H 10H RESOURCE_STALLS.ROB Cycles stalled due to re-order buffer full.
A3H 01H CYCLE_ACTIVITY.CYCLES_L2_PE Cycles with pending L2 miss loads. Set Cmask=2 to Use only when HTT is off.
NDING count cycle.
A3H 02H CYCLE_ACTIVITY.CYCLES_LDM_ Cycles with pending memory loads. Set Cmask=2 to
PENDING count cycle.
A3H 05H CYCLE_ACTIVITY.STALLS_L2_PE Number of loads missed L2. Use only when HTT is off.
NDING
A3H 08H CYCLE_ACTIVITY.CYCLES_L1D_P Cycles with pending L1 data cache miss loads. Set PMC2 only.
ENDING Cmask=8 to count cycle.
A3H 0CH CYCLE_ACTIVITY.STALLS_L1D_P Execution stalls due to L1 data cache miss loads. Set PMC2 only.
ENDING Cmask=0CH.
A8H 01H LSD.UOPS Number of uops delivered by the LSD.
AEH 01H ITLB.ITLB_FLUSH Counts the number of ITLB flushes, includes
4k/2M/4M pages.
B0H 01H OFFCORE_REQUESTS.DEMAND_ Demand data read requests sent to uncore. Use only when HTT is off.
DATA_RD
B0H 02H OFFCORE_REQUESTS.DEMAND_ Demand code read requests sent to uncore. Use only when HTT is off.
CODE_RD
B0H 04H OFFCORE_REQUESTS.DEMAND_ Demand RFO read requests sent to uncore, including Use only when HTT is off.
RFO regular RFOs, locks, ItoM.
B0H 08H OFFCORE_REQUESTS.ALL_DATA Data read requests sent to uncore (demand and Use only when HTT is off.
_RD prefetch).
B1H 02H UOPS_EXECUTED.CORE Counts total number of uops to be executed per-core Do not need to set ANY.
each cycle.
B7H 01H OFF_CORE_RESPONSE_0 See Table 18-28 or Table 18-29. Requires MSR 01A6H.
BBH 01H OFF_CORE_RESPONSE_1 See Table 18-28 or Table 18-29. Requires MSR 01A7H.
BCH 11H PAGE_WALKER_LOADS.DTLB_L1 Number of DTLB page walker loads that hit in the
L1+FB.
BCH 21H PAGE_WALKER_LOADS.ITLB_L1 Number of ITLB page walker loads that hit in the
L1+FB.
BCH 12H PAGE_WALKER_LOADS.DTLB_L2 Number of DTLB page walker loads that hit in the L2.
BCH 22H PAGE_WALKER_LOADS.ITLB_L2 Number of ITLB page walker loads that hit in the L2.
BCH 14H PAGE_WALKER_LOADS.DTLB_L3 Number of DTLB page walker loads that hit in the L3.
BCH 24H PAGE_WALKER_LOADS.ITLB_L3 Number of ITLB page walker loads that hit in the L3.
BCH 18H PAGE_WALKER_LOADS.DTLB_M Number of DTLB page walker loads from memory.
EMORY
BCH 28H PAGE_WALKER_LOADS.ITLB_ME Number of ITLB page walker loads from memory.
MORY
BDH 01H TLB_FLUSH.DTLB_THREAD DTLB flush attempts of the thread-specific entries.
BDH 20H TLB_FLUSH.STLB_ANY Count number of STLB flush attempts.
19-70 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-11. Performance Events in the Processor Core of 4th Generation Intel® Core™ Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C0H 00H INST_RETIRED.ANY_P Number of instructions at retirement. See Table 19-1.
C0H 01H INST_RETIRED.PREC_DIST Precise instruction retired event with HW to reduce PMC1 only.
effect of PEBS shadow in IP distribution.
C1H 08H OTHER_ASSISTS.AVX_TO_SSE Number of transitions from AVX-256 to legacy SSE
when penalty applicable.
C1H 10H OTHER_ASSISTS.SSE_TO_AVX Number of transitions from SSE to AVX-256 when
penalty applicable.
C1H 40H OTHER_ASSISTS.ANY_WB_ASSI Number of microcode assists invoked by HW upon
ST uop writeback.
C2H 01H UOPS_RETIRED.ALL Counts the number of micro-ops retired. Use Supports PEBS and
Cmask=1 and invert to count active cycles or stalled DataLA; use Any=1 for
cycles. core granular.
C2H 02H UOPS_RETIRED.RETIRE_SLOTS Counts the number of retirement slots used each Supports PEBS.
cycle.
C3H 02H MACHINE_CLEARS.MEMORY_OR Counts the number of machine clears due to memory
DERING order conflicts.
C3H 04H MACHINE_CLEARS.SMC Number of self-modifying-code machine clears
detected.
C3H 20H MACHINE_CLEARS.MASKMOV Counts the number of executed AVX masked load
operations that refer to an illegal address range with
the mask bits set to 0.
C4H 00H BR_INST_RETIRED.ALL_BRANC Branch instructions at retirement. See Table 19-1.
HES
C4H 01H BR_INST_RETIRED.CONDITIONA Counts the number of conditional branch instructions Supports PEBS.
L retired.
C4H 02H BR_INST_RETIRED.NEAR_CALL Direct and indirect near call instructions retired. Supports PEBS.
C4H 04H BR_INST_RETIRED.ALL_BRANC Counts the number of branch instructions retired. Supports PEBS.
HES
C4H 08H BR_INST_RETIRED.NEAR_RETU Counts the number of near return instructions Supports PEBS.
RN retired.
C4H 10H BR_INST_RETIRED.NOT_TAKEN Counts the number of not taken branch instructions
retired.
C4H 20H BR_INST_RETIRED.NEAR_TAKE Number of near taken branches retired. Supports PEBS.
N
C4H 40H BR_INST_RETIRED.FAR_BRANC Number of far branches retired.
H
C5H 00H BR_MISP_RETIRED.ALL_BRANC Mispredicted branch instructions at retirement. See Table 19-1.
HES
C5H 01H BR_MISP_RETIRED.CONDITIONA Mispredicted conditional branch instructions retired. Supports PEBS.
L
C5H 04H BR_MISP_RETIRED.ALL_BRANC Mispredicted macro branch instructions retired. Supports PEBS.
HES
C5H 20H BR_MISP_RETIRED.NEAR_TAKE Number of near branch instructions retired that
N were taken but mispredicted.
CAH 02H FP_ASSIST.X87_OUTPUT Number of X87 FP assists due to output values.
CAH 04H FP_ASSIST.X87_INPUT Number of X87 FP assists due to input values.
Vol. 3B 19-71
PERFORMANCE MONITORING EVENTS
Table 19-11. Performance Events in the Processor Core of 4th Generation Intel® Core™ Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
CAH 08H FP_ASSIST.SIMD_OUTPUT Number of SIMD FP assists due to output values.
CAH 10H FP_ASSIST.SIMD_INPUT Number of SIMD FP assists due to input values.
CAH 1EH FP_ASSIST.ANY Cycles with any input/output SSE* or FP assists.
CCH 20H ROB_MISC_EVENTS.LBR_INSER Count cases of saving new LBR records by hardware.
TS
CDH 01H MEM_TRANS_RETIRED.LOAD_L Randomly sampled loads whose latency is above a Specify threshold in MSR
ATENCY user defined threshold. A small fraction of the overall 3F6H.
loads are sampled due to randomization.
D0H 11H MEM_UOPS_RETIRED.STLB_MIS Retired load uops that miss the STLB. Supports PEBS and
S_LOADS DataLA.
D0H 12H MEM_UOPS_RETIRED.STLB_MIS Retired store uops that miss the STLB. Supports PEBS and
S_STORES DataLA.
D0H 21H MEM_UOPS_RETIRED.LOCK_LOA Retired load uops with locked access. Supports PEBS and
DS DataLA.
D0H 41H MEM_UOPS_RETIRED.SPLIT_LO Retired load uops that split across a cacheline Supports PEBS and
ADS boundary. DataLA.
D0H 42H MEM_UOPS_RETIRED.SPLIT_ST Retired store uops that split across a cacheline Supports PEBS and
ORES boundary. DataLA.
D0H 81H MEM_UOPS_RETIRED.ALL_LOAD All retired load uops. Supports PEBS and
S DataLA.
D0H 82H MEM_UOPS_RETIRED.ALL_STOR All retired store uops. Supports PEBS and
ES DataLA.
D1H 01H MEM_LOAD_UOPS_RETIRED.L1_ Retired load uops with L1 cache hits as data sources. Supports PEBS and
HIT DataLA.
D1H 02H MEM_LOAD_UOPS_RETIRED.L2_ Retired load uops with L2 cache hits as data sources. Supports PEBS and
HIT DataLA.
D1H 04H MEM_LOAD_UOPS_RETIRED.L3_ Retired load uops with L3 cache hits as data sources. Supports PEBS and
HIT DataLA.
D1H 08H MEM_LOAD_UOPS_RETIRED.L1_ Retired load uops missed L1 cache as data sources. Supports PEBS and
MISS DataLA.
D1H 10H MEM_LOAD_UOPS_RETIRED.L2_ Retired load uops missed L2. Unknown data source Supports PEBS and
MISS excluded. DataLA.
D1H 20H MEM_LOAD_UOPS_RETIRED.L3_ Retired load uops missed L3. Excludes unknown data Supports PEBS and
MISS source . DataLA.
D1H 40H MEM_LOAD_UOPS_RETIRED.HIT Retired load uops which data sources were load uops Supports PEBS and
_LFB missed L1 but hit FB due to preceding miss to the DataLA.
same cache line with data not ready.
D2H 01H MEM_LOAD_UOPS_L3_HIT_RETI Retired load uops which data sources were L3 hit Supports PEBS and
RED.XSNP_MISS and cross-core snoop missed in on-pkg core cache. DataLA.
D2H 02H MEM_LOAD_UOPS_L3_HIT_RETI Retired load uops which data sources were L3 and Supports PEBS and
RED.XSNP_HIT cross-core snoop hits in on-pkg core cache. DataLA.
D2H 04H MEM_LOAD_UOPS_L3_HIT_RETI Retired load uops which data sources were HitM Supports PEBS and
RED.XSNP_HITM responses from shared L3. DataLA.
D2H 08H MEM_LOAD_UOPS_L3_HIT_RETI Retired load uops which data sources were hits in L3 Supports PEBS and
RED.XSNP_NONE without snoops required. DataLA.
D3H 01H MEM_LOAD_UOPS_L3_MISS_RE Retired load uops which data sources missed L3 but Supports PEBS and
TIRED.LOCAL_DRAM serviced from local dram. DataLA.
19-72 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-11. Performance Events in the Processor Core of 4th Generation Intel® Core™ Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
E6H 1FH BACLEARS.ANY Number of front end re-steers due to BPU
misprediction.
F0H 01H L2_TRANS.DEMAND_DATA_RD Demand data read requests that access L2 cache.
F0H 02H L2_TRANS.RFO RFO requests that access L2 cache.
F0H 04H L2_TRANS.CODE_RD L2 cache accesses when fetching instructions.
F0H 08H L2_TRANS.ALL_PF Any MLC or L3 HW prefetch accessing L2, including
rejects.
F0H 10H L2_TRANS.L1D_WB L1D writebacks that access L2 cache.
F0H 20H L2_TRANS.L2_FILL L2 fill requests that access L2 cache.
F0H 40H L2_TRANS.L2_WB L2 writebacks that access L2 cache.
F0H 80H L2_TRANS.ALL_REQUESTS Transactions accessing L2 pipe.
F1H 01H L2_LINES_IN.I L2 cache lines in I state filling L2. Counting does not cover
rejects.
F1H 02H L2_LINES_IN.S L2 cache lines in S state filling L2. Counting does not cover
rejects.
F1H 04H L2_LINES_IN.E L2 cache lines in E state filling L2. Counting does not cover
rejects.
F1H 07H L2_LINES_IN.ALL L2 cache lines filling L2. Counting does not cover
rejects.
F2H 05H L2_LINES_OUT.DEMAND_CLEAN Clean L2 cache lines evicted by demand.
F2H 06H L2_LINES_OUT.DEMAND_DIRTY Dirty L2 cache lines evicted by demand.
Table 19-12. Intel TSX Performance Events in Processors Based on Haswell Microarchitecture
Event Umask
Num. Value Event Mask Mnemonic Description Comment
54H 01H TX_MEM.ABORT_CONFLICT Number of times a transactional abort was signaled due
to a data conflict on a transactionally accessed address.
54H 02H TX_MEM.ABORT_CAPACITY_W Number of times a transactional abort was signaled due
RITE to a data capacity limitation for transactional writes.
54H 04H TX_MEM.ABORT_HLE_STORE_ Number of times a HLE transactional region aborted due
TO_ELIDED_LOCK to a non XRELEASE prefixed instruction writing to an
elided lock in the elision buffer.
54H 08H TX_MEM.ABORT_HLE_ELISION Number of times an HLE transactional execution aborted
_BUFFER_NOT_EMPTY due to NoAllocatedElisionBuffer being non-zero.
54H 10H TX_MEM.ABORT_HLE_ELISION Number of times an HLE transactional execution aborted
_BUFFER_MISMATCH due to XRELEASE lock not satisfying the address and
value requirements in the elision buffer.
54H 20H TX_MEM.ABORT_HLE_ELISION Number of times an HLE transactional execution aborted
_BUFFER_UNSUPPORTED_ALI due to an unsupported read alignment from the elision
GNMENT buffer.
54H 40H TX_MEM.HLE_ELISION_BUFFE Number of times HLE lock could not be elided due to
R_FULL ElisionBufferAvailable being zero.
Vol. 3B 19-73
PERFORMANCE MONITORING EVENTS
Table 19-12. Intel TSX Performance Events in Processors Based on Haswell Microarchitecture
Event Umask
Num. Value Event Mask Mnemonic Description Comment
5DH 01H TX_EXEC.MISC1 Counts the number of times a class of instructions that
may cause a transactional abort was executed. Since this
is the count of execution, it may not always cause a
transactional abort.
5DH 02H TX_EXEC.MISC2 Counts the number of times a class of instructions (for
example, vzeroupper) that may cause a transactional
abort was executed inside a transactional region.
5DH 04H TX_EXEC.MISC3 Counts the number of times an instruction execution
caused the transactional nest count supported to be
exceeded.
5DH 08H TX_EXEC.MISC4 Counts the number of times an XBEGIN instruction was
executed inside an HLE transactional region.
5DH 10H TX_EXEC.MISC5 Counts the number of times an instruction with HLE-
XACQUIRE semantic was executed inside an RTM
transactional region.
C8H 01H HLE_RETIRED.START Number of times an HLE execution started. IF HLE is supported.
C8H 02H HLE_RETIRED.COMMIT Number of times an HLE execution successfully
committed.
C8H 04H HLE_RETIRED.ABORTED Number of times an HLE execution aborted due to any
reasons (multiple categories may count as one). Supports
PEBS.
C8H 08H HLE_RETIRED.ABORTED_MEM Number of times an HLE execution aborted due to
various memory events (for example, read/write
capacity and conflicts).
C8H 10H HLE_RETIRED.ABORTED_TIME Number of times an HLE execution aborted due to
R uncommon conditions.
C8H 20H HLE_RETIRED.ABORTED_UNFR Number of times an HLE execution aborted due to HLE-
IENDLY unfriendly instructions.
C8H 40H HLE_RETIRED.ABORTED_MEM Number of times an HLE execution aborted due to
TYPE incompatible memory type.
C8H 80H HLE_RETIRED.ABORTED_EVEN Number of times an HLE execution aborted due to none
TS of the previous 4 categories (for example, interrupts).
C9H 01H RTM_RETIRED.START Number of times an RTM execution started. IF RTM is supported.
C9H 02H RTM_RETIRED.COMMIT Number of times an RTM execution successfully
committed.
C9H 04H RTM_RETIRED.ABORTED Number of times an RTM execution aborted due to any
reasons (multiple categories may count as one). Supports
PEBS.
19-74 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-12. Intel TSX Performance Events in Processors Based on Haswell Microarchitecture
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C9H 08H RTM_RETIRED.ABORTED_MEM Number of times an RTM execution aborted due to IF RTM is supported.
various memory events (for example, read/write
capacity and conflicts).
C9H 10H RTM_RETIRED.ABORTED_TIME Number of times an RTM execution aborted due to
R uncommon conditions.
C9H 20H RTM_RETIRED.ABORTED_UNF Number of times an RTM execution aborted due to HLE-
RIENDLY unfriendly instructions.
C9H 40H RTM_RETIRED.ABORTED_MEM Number of times an RTM execution aborted due to
TYPE incompatible memory type.
C9H 80H RTM_RETIRED.ABORTED_EVE Number of times an RTM execution aborted due to none
NTS of the previous 4 categories (for example, interrupt).
Model-specific performance monitoring events that are located in the uncore sub-system are implementation
specific between different platforms using processors based on Haswell microarchitecture and with different
DisplayFamily_DisplayModel signatures. Processors with CPUID signature of DisplayFamily_DisplayModel 06_3CH
and 06_45H support performance events listed in Table 19-13.
Table 19-13. Uncore Performance Events in the 4th Generation Intel® Core™ Processors
Event Umask
Num.1 Value Event Mask Mnemonic Description Comment
22H 01H UNC_CBO_XSNP_RESPONSE.M A snoop misses in some processor core. Must combine with
ISS one of the umask
22H 02H UNC_CBO_XSNP_RESPONSE.I A snoop invalidates a non-modified line in some values of 20H, 40H,
NVAL processor core. 80H.
34H 08H UNC_CBO_CACHE_LOOKUP.I L3 lookup request that access cache and found line in I-
state.
34H 10H UNC_CBO_CACHE_LOOKUP.RE Filter on processor core initiated cacheable read
AD_FILTER requests. Must combine with at least one of 01H, 02H,
04H, 08H.
Vol. 3B 19-75
PERFORMANCE MONITORING EVENTS
Table 19-13. Uncore Performance Events in the 4th Generation Intel® Core™ Processors (Contd.)
Event Umask
Num.1 Value Event Mask Mnemonic Description Comment
34H 20H UNC_CBO_CACHE_LOOKUP.WR Filter on processor core initiated cacheable write
ITE_FILTER requests. Must combine with at least one of 01H, 02H,
04H, 08H.
34H 40H UNC_CBO_CACHE_LOOKUP.EX Filter on external snoop requests. Must combine with
TSNP_FILTER at least one of 01H, 02H, 04H, 08H.
34H 80H UNC_CBO_CACHE_LOOKUP.AN Filter on any IRQ or IPQ initiated requests including
Y_REQUEST_FILTER uncacheable, non-coherent requests. Must combine
with at least one of 01H, 02H, 04H, 08H.
80H 01H UNC_ARB_TRK_OCCUPANCY.A Counts cycles weighted by the number of requests Counter 0 only.
LL waiting for data returning from the memory controller.
Accounts for coherent and non-coherent requests
initiated by IA cores, processor graphic units, or L3.
81H 01H UNC_ARB_TRK_REQUEST.ALL Counts the number of coherent and in-coherent
requests initiated by IA cores, processor graphic units,
or L3.
81H 20H UNC_ARB_TRK_REQUEST.WRI Counts the number of allocated write entries, include
TES full, partial, and L3 evictions.
81H 80H UNC_ARB_TRK_REQUEST.EVIC Counts the number of L3 evictions allocated.
TIONS
83H 01H UNC_ARB_COH_TRK_OCCUPA Cycles weighted by number of requests pending in Counter 0 only.
NCY.ALL Coherency Tracker.
84H 01H UNC_ARB_COH_TRK_REQUES Number of requests allocated in Coherency Tracker.
T.ALL
NOTES:
1. The uncore events must be programmed using MSRs located in specific performance monitoring units in the uncore. UNC_CBO*
events are supported using MSR_UNC_CBO* MSRs; UNC_ARB* events are supported using MSR_UNC_ARB*MSRs.
19.7.1 Performance Monitoring Events in the Processor Core of Intel Xeon Processor E5 v3
Family
Model-specific performance monitoring events in the processor core that are applicable only to Intel Xeon
processor E5 v3 family based on the Haswell-E microarchitecture, with CPUID signature of
DisplayFamily_DisplayModel 06_3FH, are listed in Table 19-14. The performance events listed in Table 19-11 and
Table 19-12 also apply Intel Xeon processor E5 v3 family, except that the OFF_CORE_RESPONSE_x event listed in
Table 19-11 should reference Table 18-30.
Uncore performance monitoring events for Intel Xeon Processor E5 v3 families are described in “Intel® Xeon®
Processor E5 v3 Uncore Performance Monitoring Programming Reference Manual”.
Table 19-14. Performance Events Applicable only to the Processor Core of Intel® Xeon® Processor E5 v3 Family
Event Umask
Num. Value Event Mask Mnemonic Description Comment
D3H 04H MEM_LOAD_UOPS_L3_MISS_RE Retired load uops whose data sources were remote Supports PEBS.
TIRED.REMOTE_DRAM DRAM (snoop not needed, Snoop Miss).
D3H 10H MEM_LOAD_UOPS_L3_MISS_RE Retired load uops whose data sources were remote Supports PEBS.
TIRED.REMOTE_HITM cache HITM.
D3H 20H MEM_LOAD_UOPS_L3_MISS_RE Retired load uops whose data sources were forwards Supports PEBS.
TIRED.REMOTE_FWD from a remote cache.
19-76 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-15. Performance Events In the Processor Core of 3rd Generation Intel® Core™ i7, i5, i3 Processors
Event Umask
Num. Value Event Mask Mnemonic Description Comment
03H 02H LD_BLOCKS.STORE_FORWARD Loads blocked by overlapping with store buffer that
cannot be forwarded.
03H 08H LD_BLOCKS.NO_SR The number of times that split load operations are
temporarily blocked because all resources for
handling the split accesses are in use.
05H 01H MISALIGN_MEM_REF.LOADS Speculative cache-line split load uops dispatched to
L1D.
05H 02H MISALIGN_MEM_REF.STORES Speculative cache-line split Store-address uops
dispatched to L1D.
07H 01H LD_BLOCKS_PARTIAL.ADDRESS_ False dependencies in MOB due to partial compare
ALIAS on address.
08H 81H DTLB_LOAD_MISSES.MISS_CAUSE Misses in all TLB levels that cause a page walk of
S_A_WALK any page size from demand loads.
08H 82H DTLB_LOAD_MISSES.WALK_COM Misses in all TLB levels that caused page walk
PLETED completed of any size by demand loads.
08H 84H DTLB_LOAD_MISSES.WALK_DUR Cycle PMH is busy with a walk due to demand loads.
ATION
08H 88H DTLB_LOAD_MISSES.LARGE_PAG Page walk for a large page completed for Demand
E_WALK_DURATION load.
0EH 01H UOPS_ISSUED.ANY Increments each cycle the # of Uops issued by the Set Cmask = 1, Inv = 1to
RAT to RS. Set Cmask = 1, Inv = 1, Any= 1to count count stalled cycles.
stalled cycles of this core.
0EH 10H UOPS_ISSUED.FLAGS_MERGE Number of flags-merge uops allocated. Such uops
adds delay.
0EH 20H UOPS_ISSUED.SLOW_LEA Number of slow LEA or similar uops allocated. Such
uop has 3 sources (e.g. 2 sources + immediate)
regardless if as a result of LEA instruction or not.
0EH 40H UOPS_ISSUED.SiNGLE_MUL Number of multiply packed/scalar single precision
uops allocated.
10H 01H FP_COMP_OPS_EXE.X87 Counts number of X87 uops executed.
10H 10H FP_COMP_OPS_EXE.SSE_FP_PAC Counts number of SSE* or AVX-128 double
KED_DOUBLE precision FP packed uops executed.
10H 20H FP_COMP_OPS_EXE.SSE_FP_SCA Counts number of SSE* or AVX-128 single precision
LAR_SINGLE FP scalar uops executed.
10H 40H FP_COMP_OPS_EXE.SSE_PACKED Counts number of SSE* or AVX-128 single precision
SINGLE FP packed uops executed.
Vol. 3B 19-77
PERFORMANCE MONITORING EVENTS
Table 19-15. Performance Events In the Processor Core of 3rd Generation Intel® Core™ i7, i5, i3 Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
10H 80H FP_COMP_OPS_EXE.SSE_SCALAR Counts number of SSE* or AVX-128 double
_DOUBLE precision FP scalar uops executed.
11H 01H SIMD_FP_256.PACKED_SINGLE Counts 256-bit packed single-precision floating-
point instructions.
11H 02H SIMD_FP_256.PACKED_DOUBLE Counts 256-bit packed double-precision floating-
point instructions.
14H 01H ARITH.FPU_DIV_ACTIVE Cycles that the divider is active, includes INT and FP.
Set 'edge =1, cmask=1' to count the number of
divides.
24H 01H L2_RQSTS.DEMAND_DATA_RD_H Demand Data Read requests that hit L2 cache.
IT
24H 03H L2_RQSTS.ALL_DEMAND_DATA_ Counts any demand and L1 HW prefetch data load
RD requests to L2.
24H 04H L2_RQSTS.RFO_HITS Counts the number of store RFO requests that hit
the L2 cache.
24H 08H L2_RQSTS.RFO_MISS Counts the number of store RFO requests that miss
the L2 cache.
24H 0CH L2_RQSTS.ALL_RFO Counts all L2 store RFO requests.
24H 10H L2_RQSTS.CODE_RD_HIT Number of instruction fetches that hit the L2 cache.
24H 20H L2_RQSTS.CODE_RD_MISS Number of instruction fetches that missed the L2
cache.
24H 30H L2_RQSTS.ALL_CODE_RD Counts all L2 code requests.
24H 40H L2_RQSTS.PF_HIT Counts all L2 HW prefetcher requests that hit L2.
24H 80H L2_RQSTS.PF_MISS Counts all L2 HW prefetcher requests that missed
L2.
24H C0H L2_RQSTS.ALL_PF Counts all L2 HW prefetcher requests.
27H 01H L2_STORE_LOCK_RQSTS.MISS RFOs that miss cache lines.
27H 08H L2_STORE_LOCK_RQSTS.HIT_M RFOs that hit cache lines in M state.
27H 0FH L2_STORE_LOCK_RQSTS.ALL RFOs that access cache lines in any state.
28H 01H L2_L1D_WB_RQSTS.MISS Not rejected writebacks that missed LLC.
28H 04H L2_L1D_WB_RQSTS.HIT_E Not rejected writebacks from L1D to L2 cache lines
in E state.
28H 08H L2_L1D_WB_RQSTS.HIT_M Not rejected writebacks from L1D to L2 cache lines
in M state.
28H 0FH L2_L1D_WB_RQSTS.ALL Not rejected writebacks from L1D to L2 cache lines
in any state.
2EH 4FH LONGEST_LAT_CACHE.REFERENC This event counts requests originating from the See Table 19-1
E core that reference a cache line in the last level
cache.
2EH 41H LONGEST_LAT_CACHE.MISS This event counts each cache miss condition for See Table 19-1
references to the last level cache.
3CH 00H CPU_CLK_UNHALTED.THREAD_P Counts the number of thread cycles while the See Table 19-1.
thread is not in a halt state. The thread enters the
halt state when it is running the HLT instruction.
The core frequency may change from time to time
due to power or thermal throttling.
19-78 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-15. Performance Events In the Processor Core of 3rd Generation Intel® Core™ i7, i5, i3 Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
3CH 01H CPU_CLK_THREAD_UNHALTED.R Increments at the frequency of XCLK (100 MHz) See Table 19-1.
EF_XCLK when not halted.
48H 01H L1D_PEND_MISS.PENDING Increments the number of outstanding L1D misses PMC2 only;
every cycle. Set Cmask = 1 and Edge =1 to count Set Cmask = 1 to count
occurrences. cycles.
49H 01H DTLB_STORE_MISSES.MISS_CAUS Miss in all TLB levels causes a page walk of any
ES_A_WALK page size (4K/2M/4M/1G).
49H 02H DTLB_STORE_MISSES.WALK_CO Miss in all TLB levels causes a page walk that
MPLETED completes of any page size (4K/2M/4M/1G).
49H 04H DTLB_STORE_MISSES.WALK_DUR Cycles PMH is busy with this walk.
ATION
49H 10H DTLB_STORE_MISSES.STLB_HIT Store operations that miss the first TLB level but hit
the second and do not cause page walks.
4CH 01H LOAD_HIT_PRE.SW_PF Non-SW-prefetch load dispatches that hit fill buffer
allocated for S/W prefetch.
4CH 02H LOAD_HIT_PRE.HW_PF Non-SW-prefetch load dispatches that hit fill buffer
allocated for H/W prefetch.
51H 01H L1D.REPLACEMENT Counts the number of lines brought into the L1 data
cache.
58H 04H MOVE_ELIMINATION.INT_NOT_EL Number of integer Move Elimination candidate uops
IMINATED that were not eliminated.
58H 08H MOVE_ELIMINATION.SIMD_NOT_E Number of SIMD Move Elimination candidate uops
LIMINATED that were not eliminated.
58H 01H MOVE_ELIMINATION.INT_ELIMINA Number of integer Move Elimination candidate uops
TED that were eliminated.
58H 02H MOVE_ELIMINATION.SIMD_ELIMIN Number of SIMD Move Elimination candidate uops
ATED that were eliminated.
5CH 01H CPL_CYCLES.RING0 Unhalted core cycles when the thread is in ring 0. Use Edge to count
transition.
5CH 02H CPL_CYCLES.RING123 Unhalted core cycles when the thread is not in ring
0.
5EH 01H RS_EVENTS.EMPTY_CYCLES Cycles the RS is empty for the thread.
5FH 04H DTLB_LOAD_MISSES.STLB_HIT Counts load operations that missed 1st level DTLB
but hit the 2nd level.
60H 01H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding Demand Data Read
DING.DEMAND_DATA_RD transactions in SQ to uncore. Set Cmask=1 to count
cycles.
60H 02H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding Demand Code Read
DING.DEMAND_CODE_RD transactions in SQ to uncore. Set Cmask=1 to count
cycles.
60H 04H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding RFO store transactions in SQ to
DING.DEMAND_RFO uncore. Set Cmask=1 to count cycles.
60H 08H OFFCORE_REQUESTS_OUTSTAN Offcore outstanding cacheable data read
DING.ALL_DATA_RD transactions in SQ to uncore. Set Cmask=1 to count
cycles.
63H 01H LOCK_CYCLES.SPLIT_LOCK_UC_L Cycles in which the L1D and L2 are locked, due to a
OCK_DURATION UC lock or split lock.
Vol. 3B 19-79
PERFORMANCE MONITORING EVENTS
Table 19-15. Performance Events In the Processor Core of 3rd Generation Intel® Core™ i7, i5, i3 Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
63H 02H LOCK_CYCLES.CACHE_LOCK_DUR Cycles in which the L1D is locked.
ATION
79H 02H IDQ.EMPTY Counts cycles the IDQ is empty.
79H 04H IDQ.MITE_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H
from MITE path. Set Cmask = 1 to count cycles. and 20H.
79H 08H IDQ.DSB_UOPS Increment each cycle. # of uops delivered to IDQ Can combine Umask 08H
from DSB path. Set Cmask = 1 to count cycles. and 10H.
79H 10H IDQ.MS_DSB_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H,
when MS_busy by DSB. Set Cmask = 1 to count 08H.
cycles. Add Edge=1 to count # of delivery.
79H 20H IDQ.MS_MITE_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H,
when MS_busy by MITE. Set Cmask = 1 to count 08H.
cycles.
79H 30H IDQ.MS_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H,
from MS by either DSB or MITE. Set Cmask = 1 to 08H.
count cycles.
79H 18H IDQ.ALL_DSB_CYCLES_ANY_UOP Counts cycles DSB is delivered at least one uops.
S Set Cmask = 1.
79H 18H IDQ.ALL_DSB_CYCLES_4_UOPS Counts cycles DSB is delivered four uops. Set Cmask
= 4.
79H 24H IDQ.ALL_MITE_CYCLES_ANY_UOP Counts cycles MITE is delivered at least one uops.
S Set Cmask = 1.
79H 24H IDQ.ALL_MITE_CYCLES_4_UOPS Counts cycles MITE is delivered four uops. Set
Cmask = 4.
79H 3CH IDQ.MITE_ALL_UOPS # of uops delivered to IDQ from any path.
80H 04H ICACHE.IFETCH_STALL Cycles where a code-fetch stalled due to L1
instruction-cache miss or an iTLB miss.
80H 02H ICACHE.MISSES Number of Instruction Cache, Streaming Buffer and
Victim Cache Misses. Includes UC accesses.
85H 01H ITLB_MISSES.MISS_CAUSES_A_W Misses in all ITLB levels that cause page walks.
ALK
85H 02H ITLB_MISSES.WALK_COMPLETED Misses in all ITLB levels that cause completed page
walks.
85H 04H ITLB_MISSES.WALK_DURATION Cycle PMH is busy with a walk.
85H 10H ITLB_MISSES.STLB_HIT Number of cache load STLB hits. No page walk.
87H 01H ILD_STALL.LCP Stalls caused by changing prefix length of the
instruction.
87H 04H ILD_STALL.IQ_FULL Stall cycles due to IQ is full.
88H 01H BR_INST_EXEC.COND Qualify conditional near branch instructions Must combine with
executed, but not necessarily retired. umask 40H, 80H.
88H 02H BR_INST_EXEC.DIRECT_JMP Qualify all unconditional near branch instructions Must combine with
excluding calls and indirect branches. umask 80H.
88H 04H BR_INST_EXEC.INDIRECT_JMP_N Qualify executed indirect near branch instructions Must combine with
ON_CALL_RET that are not calls or returns. umask 80H.
88H 08H BR_INST_EXEC.RETURN_NEAR Qualify indirect near branches that have a return Must combine with
mnemonic. umask 80H.
19-80 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-15. Performance Events In the Processor Core of 3rd Generation Intel® Core™ i7, i5, i3 Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
88H 10H BR_INST_EXEC.DIRECT_NEAR_C Qualify unconditional near call branch instructions, Must combine with
ALL excluding non-call branch, executed. umask 80H.
88H 20H BR_INST_EXEC.INDIRECT_NEAR_ Qualify indirect near calls, including both register Must combine with
CALL and memory indirect, executed. umask 80H.
88H 40H BR_INST_EXEC.NONTAKEN Qualify non-taken near branches executed. Applicable to umask 01H
only.
88H 80H BR_INST_EXEC.TAKEN Qualify taken near branches executed. Must
combine with 01H,02H, 04H, 08H, 10H, 20H.
88H FFH BR_INST_EXEC.ALL_BRANCHES Counts all near executed branches (not necessarily
retired).
89H 01H BR_MISP_EXEC.COND Qualify conditional near branch instructions Must combine with
mispredicted. umask 40H, 80H.
89H 04H BR_MISP_EXEC.INDIRECT_JMP_N Qualify mispredicted indirect near branch Must combine with
ON_CALL_RET instructions that are not calls or returns. umask 80H.
89H 08H BR_MISP_EXEC.RETURN_NEAR Qualify mispredicted indirect near branches that Must combine with
have a return mnemonic. umask 80H.
89H 10H BR_MISP_EXEC.DIRECT_NEAR_C Qualify mispredicted unconditional near call branch Must combine with
ALL instructions, excluding non-call branch, executed. umask 80H.
89H 20H BR_MISP_EXEC.INDIRECT_NEAR_ Qualify mispredicted indirect near calls, including Must combine with
CALL both register and memory indirect, executed. umask 80H.
89H 40H BR_MISP_EXEC.NONTAKEN Qualify mispredicted non-taken near branches Applicable to umask 01H
executed. only.
89H 80H BR_MISP_EXEC.TAKEN Qualify mispredicted taken near branches executed.
Must combine with 01H,02H, 04H, 08H, 10H, 20H.
89H FFH BR_MISP_EXEC.ALL_BRANCHES Counts all near executed branches (not necessarily
retired).
9CH 01H IDQ_UOPS_NOT_DELIVERED.COR Count issue pipeline slots where no uop was Use Cmask to qualify uop
E delivered from the front end to the back end when b/w.
there is no back-end stall.
A1H 01H UOPS_DISPATCHED_PORT.PORT_ Cycles which a Uop is dispatched on port 0.
0
A1H 02H UOPS_DISPATCHED_PORT.PORT_ Cycles which a Uop is dispatched on port 1.
1
A1H 0CH UOPS_DISPATCHED_PORT.PORT_ Cycles which a Uop is dispatched on port 2.
2
A1H 30H UOPS_DISPATCHED_PORT.PORT_ Cycles which a Uop is dispatched on port 3.
3
A1H 40H UOPS_DISPATCHED_PORT.PORT_ Cycles which a Uop is dispatched on port 4.
4
A1H 80H UOPS_DISPATCHED_PORT.PORT_ Cycles which a Uop is dispatched on port 5.
5
A2H 01H RESOURCE_STALLS.ANY Cycles Allocation is stalled due to Resource Related
reason.
A2H 04H RESOURCE_STALLS.RS Cycles stalled due to no eligible RS entry available.
A2H 08H RESOURCE_STALLS.SB Cycles stalled due to no store buffers available (not
including draining form sync).
A2H 10H RESOURCE_STALLS.ROB Cycles stalled due to re-order buffer full.
Vol. 3B 19-81
PERFORMANCE MONITORING EVENTS
Table 19-15. Performance Events In the Processor Core of 3rd Generation Intel® Core™ i7, i5, i3 Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
A3H 01H CYCLE_ACTIVITY.CYCLES_L2_PEN Cycles with pending L2 miss loads. Set AnyThread
DING to count per core.
A3H 02H CYCLE_ACTIVITY.CYCLES_LDM_P Cycles with pending memory loads. Set AnyThread Restricted to counters 0-
ENDING to count per core. 3 when HTT is disabled.
A3H 04H CYCLE_ACTIVITY.CYCLES_NO_EX Cycles of dispatch stalls. Set AnyThread to count Restricted to counters 0-
ECUTE per core. 3 when HTT is disabled.
A3H 05H CYCLE_ACTIVITY.STALLS_L2_PEN Number of loads missed L2. Restricted to counters 0-
DING 3 when HTT is disabled.
A3H 06H CYCLE_ACTIVITY.STALLS_LDM_P Restricted to counters 0-
ENDING 3 when HTT is disabled.
A3H 08H CYCLE_ACTIVITY.CYCLES_L1D_PE Cycles with pending L1 cache miss loads. Set PMC2 only.
NDING AnyThread to count per core.
A3H 0CH CYCLE_ACTIVITY.STALLS_L1D_PE Execution stalls due to L1 data cache miss loads. PMC2 only.
NDING Set Cmask=0CH.
A8H 01H LSD.UOPS Number of Uops delivered by the LSD.
ABH 01H DSB2MITE_SWITCHES.COUNT Number of DSB to MITE switches.
ABH 02H DSB2MITE_SWITCHES.PENALTY_ Cycles DSB to MITE switches caused delay.
CYCLES
ACH 08H DSB_FILL.EXCEED_DSB_LINES DSB Fill encountered > 3 DSB lines.
AEH 01H ITLB.ITLB_FLUSH Counts the number of ITLB flushes, includes
4k/2M/4M pages.
B0H 01H OFFCORE_REQUESTS.DEMAND_D Demand data read requests sent to uncore.
ATA_RD
B0H 02H OFFCORE_REQUESTS.DEMAND_C Demand code read requests sent to uncore.
ODE_RD
B0H 04H OFFCORE_REQUESTS.DEMAND_R Demand RFO read requests sent to uncore,
FO including regular RFOs, locks, ItoM.
B0H 08H OFFCORE_REQUESTS.ALL_DATA_ Data read requests sent to uncore (demand and
RD prefetch).
B1H 01H UOPS_EXECUTED.THREAD Counts total number of uops to be executed per-
thread each cycle. Set Cmask = 1, INV =1 to count
stall cycles.
B1H 02H UOPS_EXECUTED.CORE Counts total number of uops to be executed per- Do not need to set ANY.
core each cycle.
B7H 01H OFFCORE_RESPONSE_0 See Section 18.3.4.5, “Off-core Response Requires MSR 01A6H.
Performance Monitoring”.
BBH 01H OFFCORE_RESPONSE_1 See Section 18.3.4.5, “Off-core Response Requires MSR 01A7H.
Performance Monitoring”.
BDH 01H TLB_FLUSH.DTLB_THREAD DTLB flush attempts of the thread-specific entries.
BDH 20H TLB_FLUSH.STLB_ANY Count number of STLB flush attempts.
C0H 00H INST_RETIRED.ANY_P Number of instructions at retirement. See Table 19-1.
C0H 01H INST_RETIRED.PREC_DIST Precise instruction retired event with HW to reduce PMC1 only.
effect of PEBS shadow in IP distribution.
C1H 08H OTHER_ASSISTS.AVX_STORE Number of assists associated with 256-bit AVX
store operations.
19-82 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-15. Performance Events In the Processor Core of 3rd Generation Intel® Core™ i7, i5, i3 Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C1H 10H OTHER_ASSISTS.AVX_TO_SSE Number of transitions from AVX-256 to legacy SSE
when penalty applicable.
C1H 20H OTHER_ASSISTS.SSE_TO_AVX Number of transitions from SSE to AVX-256 when
penalty applicable.
C1H 80H OTHER_ASSISTS.WB Number of times microcode assist is invoked by
hardware upon uop writeback.
C2H 01H UOPS_RETIRED.ALL Counts the number of micro-ops retired, Use Supports PEBS, use
cmask=1 and invert to count active cycles or stalled Any=1 for core granular.
cycles.
C2H 02H UOPS_RETIRED.RETIRE_SLOTS Counts the number of retirement slots used each Supports PEBS.
cycle.
C3H 02H MACHINE_CLEARS.MEMORY_ORD Counts the number of machine clears due to
ERING memory order conflicts.
C3H 04H MACHINE_CLEARS.SMC Number of self-modifying-code machine clears
detected.
C3H 20H MACHINE_CLEARS.MASKMOV Counts the number of executed AVX masked load
operations that refer to an illegal address range
with the mask bits set to 0.
C4H 00H BR_INST_RETIRED.ALL_BRANCH Branch instructions at retirement. See Table 19-1.
ES
C4H 01H BR_INST_RETIRED.CONDITIONAL Counts the number of conditional branch Supports PEBS.
instructions retired.
C4H 02H BR_INST_RETIRED.NEAR_CALL Direct and indirect near call instructions retired. Supports PEBS.
C4H 04H BR_INST_RETIRED.ALL_BRANCH Counts the number of branch instructions retired. Supports PEBS.
ES
C4H 08H BR_INST_RETIRED.NEAR_RETUR Counts the number of near return instructions Supports PEBS.
N retired.
C4H 10H BR_INST_RETIRED.NOT_TAKEN Counts the number of not taken branch instructions Supports PEBS.
retired.
C4H 20H BR_INST_RETIRED.NEAR_TAKEN Number of near taken branches retired. Supports PEBS.
C4H 40H BR_INST_RETIRED.FAR_BRANCH Number of far branches retired. Supports PEBS.
C5H 00H BR_MISP_RETIRED.ALL_BRANCH Mispredicted branch instructions at retirement. See Table 19-1.
ES
C5H 01H BR_MISP_RETIRED.CONDITIONAL Mispredicted conditional branch instructions retired. Supports PEBS.
C5H 04H BR_MISP_RETIRED.ALL_BRANCH Mispredicted macro branch instructions retired. Supports PEBS.
ES
C5H 20H BR_MISP_RETIRED.NEAR_TAKEN Mispredicted taken branch instructions retired. Supports PEBS.
CAH 02H FP_ASSIST.X87_OUTPUT Number of X87 FP assists due to output values. Supports PEBS.
CAH 04H FP_ASSIST.X87_INPUT Number of X87 FP assists due to input values. Supports PEBS.
CAH 08H FP_ASSIST.SIMD_OUTPUT Number of SIMD FP assists due to output values. Supports PEBS.
CAH 10H FP_ASSIST.SIMD_INPUT Number of SIMD FP assists due to input values.
CAH 1EH FP_ASSIST.ANY Cycles with any input/output SSE* or FP assists.
CCH 20H ROB_MISC_EVENTS.LBR_INSERT Count cases of saving new LBR records by
S hardware.
Vol. 3B 19-83
PERFORMANCE MONITORING EVENTS
Table 19-15. Performance Events In the Processor Core of 3rd Generation Intel® Core™ i7, i5, i3 Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
CDH 01H MEM_TRANS_RETIRED.LOAD_LA Randomly sampled loads whose latency is above a Specify threshold in MSR
TENCY user defined threshold. A small fraction of the 3F6H. PMC 3 only.
overall loads are sampled due to randomization.
CDH 02H MEM_TRANS_RETIRED.PRECISE_ Sample stores and collect precise store operation See Section 18.3.4.4.3.
STORE via PEBS record. PMC3 only.
D0H 11H MEM_UOPS_RETIRED.STLB_MISS Retired load uops that miss the STLB. Supports PEBS.
_LOADS
D0H 12H MEM_UOPS_RETIRED.STLB_MISS Retired store uops that miss the STLB. Supports PEBS.
_STORES
D0H 21H MEM_UOPS_RETIRED.LOCK_LOA Retired load uops with locked access. Supports PEBS.
DS
D0H 41H MEM_UOPS_RETIRED.SPLIT_LOA Retired load uops that split across a cacheline Supports PEBS.
DS boundary.
D0H 42H MEM_UOPS_RETIRED.SPLIT_STO Retired store uops that split across a cacheline Supports PEBS.
RES boundary.
D0H 81H MEM_UOPS_RETIRED.ALL_LOADS All retired load uops. Supports PEBS.
D0H 82H MEM_UOPS_RETIRED.ALL_STORE All retired store uops. Supports PEBS.
S
D1H 01H MEM_LOAD_UOPS_RETIRED.L1_ Retired load uops with L1 cache hits as data Supports PEBS.
HIT sources.
D1H 02H MEM_LOAD_UOPS_RETIRED.L2_ Retired load uops with L2 cache hits as data Supports PEBS.
HIT sources.
D1H 04H MEM_LOAD_UOPS_RETIRED.LLC_ Retired load uops whose data source was LLC hit Supports PEBS.
HIT with no snoop required.
D1H 08H MEM_LOAD_UOPS_RETIRED.L1_ Retired load uops whose data source followed an Supports PEBS.
MISS L1 miss.
D1H 10H MEM_LOAD_UOPS_RETIRED.L2_ Retired load uops that missed L2, excluding Supports PEBS.
MISS unknown sources.
D1H 20H MEM_LOAD_UOPS_RETIRED.LLC_ Retired load uops whose data source is LLC miss. Supports PEBS.
MISS Restricted to counters 0-
3 when HTT is disabled.
D1H 40H MEM_LOAD_UOPS_RETIRED.HIT_ Retired load uops which data sources were load Supports PEBS.
LFB uops missed L1 but hit FB due to preceding miss to
the same cache line with data not ready.
D2H 01H MEM_LOAD_UOPS_LLC_HIT_RETI Retired load uops whose data source was an on- Supports PEBS.
RED.XSNP_MISS package core cache LLC hit and cross-core snoop
missed.
D2H 02H MEM_LOAD_UOPS_LLC_HIT_RETI Retired load uops whose data source was an on- Supports PEBS.
RED.XSNP_HIT package LLC hit and cross-core snoop hits.
D2H 04H MEM_LOAD_UOPS_LLC_HIT_RETI Retired load uops whose data source was an on- Supports PEBS.
RED.XSNP_HITM package core cache with HitM responses.
D2H 08H MEM_LOAD_UOPS_LLC_HIT_RETI Retired load uops whose data source was LLC hit Supports PEBS.
RED.XSNP_NONE with no snoop required.
D3H 01H MEM_LOAD_UOPS_LLC_MISS_RE Retired load uops whose data source was local Supports PEBS.
TIRED.LOCAL_DRAM memory (cross-socket snoop not needed or missed).
E6H 1FH BACLEARS.ANY Number of front end re-steers due to BPU
misprediction.
19-84 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-15. Performance Events In the Processor Core of 3rd Generation Intel® Core™ i7, i5, i3 Processors (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
F0H 01H L2_TRANS.DEMAND_DATA_RD Demand Data Read requests that access L2 cache.
F0H 02H L2_TRANS.RFO RFO requests that access L2 cache.
F0H 04H L2_TRANS.CODE_RD L2 cache accesses when fetching instructions.
F0H 08H L2_TRANS.ALL_PF Any MLC or LLC HW prefetch accessing L2, including
rejects.
F0H 10H L2_TRANS.L1D_WB L1D writebacks that access L2 cache.
F0H 20H L2_TRANS.L2_FILL L2 fill requests that access L2 cache.
F0H 40H L2_TRANS.L2_WB L2 writebacks that access L2 cache.
F0H 80H L2_TRANS.ALL_REQUESTS Transactions accessing L2 pipe.
F1H 01H L2_LINES_IN.I L2 cache lines in I state filling L2. Counting does not cover
rejects.
F1H 02H L2_LINES_IN.S L2 cache lines in S state filling L2. Counting does not cover
rejects.
F1H 04H L2_LINES_IN.E L2 cache lines in E state filling L2. Counting does not cover
rejects.
F1H 07H L2_LINES_IN.ALL L2 cache lines filling L2. Counting does not cover
rejects.
F2H 01H L2_LINES_OUT.DEMAND_CLEAN Clean L2 cache lines evicted by demand.
F2H 02H L2_LINES_OUT.DEMAND_DIRTY Dirty L2 cache lines evicted by demand.
F2H 04H L2_LINES_OUT.PF_CLEAN Clean L2 cache lines evicted by the MLC prefetcher.
F2H 08H L2_LINES_OUT.PF_DIRTY Dirty L2 cache lines evicted by the MLC prefetcher.
F2H 0AH L2_LINES_OUT.DIRTY_ALL Dirty L2 cache lines filling the L2. Counting does not cover
rejects.
19.8.1 Performance Monitoring Events in the Processor Core of Intel Xeon Processor E5 v2
Family and Intel Xeon Processor E7 v2 Family
Model-specific performance monitoring events in the processor core that are applicable only to Intel Xeon
processor E5 v2 family and Intel Xeon processor E7 v2 family based on the Ivy Bridge-E microarchitecture, with
CPUID signature of DisplayFamily_DisplayModel 06_3EH, are listed in Table 19-16.
Vol. 3B 19-85
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family
Event Umask
Num. Value Event Mask Mnemonic Description Comment
03H 01H LD_BLOCKS.DATA_UNKNOWN Blocked loads due to store buffer blocks with
unknown data.
03H 02H LD_BLOCKS.STORE_FORWARD Loads blocked by overlapping with store buffer that
cannot be forwarded.
03H 08H LD_BLOCKS.NO_SR # of Split loads blocked due to resource not
available.
03H 10H LD_BLOCKS.ALL_BLOCK Number of cases where any load is blocked but has
no DCU miss.
05H 01H MISALIGN_MEM_REF.LOADS Speculative cache-line split load uops dispatched to
L1D.
05H 02H MISALIGN_MEM_REF.STORES Speculative cache-line split Store-address uops
dispatched to L1D.
07H 01H LD_BLOCKS_PARTIAL.ADDRES False dependencies in MOB due to partial compare
S_ALIAS on address.
07H 08H LD_BLOCKS_PARTIAL.ALL_STA The number of times that load operations are
_BLOCK temporarily blocked because of older stores, with
addresses that are not yet known. A load operation
may incur more than one block of this type.
08H 01H DTLB_LOAD_MISSES.MISS_CA Misses in all TLB levels that cause a page walk of
USES_A_WALK any page size.
08H 02H DTLB_LOAD_MISSES.WALK_CO Misses in all TLB levels that caused page walk
MPLETED completed of any size.
08H 04H DTLB_LOAD_MISSES.WALK_DU Cycle PMH is busy with a walk.
RATION
08H 10H DTLB_LOAD_MISSES.STLB_HIT Number of cache load STLB hits. No page walk.
0DH 03H INT_MISC.RECOVERY_CYCLES Cycles waiting to recover after Machine Clears or Set Edge to count
JEClear. Set Cmask= 1. occurrences.
0DH 40H INT_MISC.RAT_STALL_CYCLES Cycles RAT external stall is sent to IDQ for this
thread.
19-86 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
0EH 01H UOPS_ISSUED.ANY Increments each cycle the # of Uops issued by the Set Cmask = 1, Inv = 1to
RAT to RS. Set Cmask = 1, Inv = 1, Any= 1to count count stalled cycles.
stalled cycles of this core.
10H 01H FP_COMP_OPS_EXE.X87 Counts number of X87 uops executed.
10H 10H FP_COMP_OPS_EXE.SSE_FP_P Counts number of SSE* double precision FP packed
ACKED_DOUBLE uops executed.
10H 20H FP_COMP_OPS_EXE.SSE_FP_S Counts number of SSE* single precision FP scalar
CALAR_SINGLE uops executed.
10H 40H FP_COMP_OPS_EXE.SSE_PACK Counts number of SSE* single precision FP packed
ED SINGLE uops executed.
10H 80H FP_COMP_OPS_EXE.SSE_SCAL Counts number of SSE* double precision FP scalar
AR_DOUBLE uops executed.
11H 01H SIMD_FP_256.PACKED_SINGLE Counts 256-bit packed single-precision floating-
point instructions.
11H 02H SIMD_FP_256.PACKED_DOUBL Counts 256-bit packed double-precision floating-
E point instructions.
14H 01H ARITH.FPU_DIV_ACTIVE Cycles that the divider is active, includes INT and FP.
Set 'edge =1, cmask=1' to count the number of
divides.
17H 01H INSTS_WRITTEN_TO_IQ.INSTS Counts the number of instructions written into the
IQ every cycle.
24H 01H L2_RQSTS.DEMAND_DATA_RD Demand Data Read requests that hit L2 cache.
_HIT
24H 03H L2_RQSTS.ALL_DEMAND_DAT Counts any demand and L1 HW prefetch data load
A_RD requests to L2.
24H 04H L2_RQSTS.RFO_HITS Counts the number of store RFO requests that hit
the L2 cache.
24H 08H L2_RQSTS.RFO_MISS Counts the number of store RFO requests that miss
the L2 cache.
24H 0CH L2_RQSTS.ALL_RFO Counts all L2 store RFO requests.
24H 10H L2_RQSTS.CODE_RD_HIT Number of instruction fetches that hit the L2 cache.
24H 20H L2_RQSTS.CODE_RD_MISS Number of instruction fetches that missed the L2
cache.
24H 30H L2_RQSTS.ALL_CODE_RD Counts all L2 code requests.
24H 40H L2_RQSTS.PF_HIT Requests from L2 Hardware prefetcher that hit L2.
24H 80H L2_RQSTS.PF_MISS Requests from L2 Hardware prefetcher that missed
L2.
24H C0H L2_RQSTS.ALL_PF Any requests from L2 Hardware prefetchers.
27H 01H L2_STORE_LOCK_RQSTS.MISS RFOs that miss cache lines.
27H 04H L2_STORE_LOCK_RQSTS.HIT_ RFOs that hit cache lines in E state.
E
27H 08H L2_STORE_LOCK_RQSTS.HIT_ RFOs that hit cache lines in M state.
M
27H 0FH L2_STORE_LOCK_RQSTS.ALL RFOs that access cache lines in any state.
Vol. 3B 19-87
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
28H 01H L2_L1D_WB_RQSTS.MISS Not rejected writebacks from L1D to L2 cache lines
that missed L2.
28H 02H L2_L1D_WB_RQSTS.HIT_S Not rejected writebacks from L1D to L2 cache lines
in S state.
28H 04H L2_L1D_WB_RQSTS.HIT_E Not rejected writebacks from L1D to L2 cache lines
in E state.
28H 08H L2_L1D_WB_RQSTS.HIT_M Not rejected writebacks from L1D to L2 cache lines
in M state.
28H 0FH L2_L1D_WB_RQSTS.ALL Not rejected writebacks from L1D to L2 cache.
2EH 4FH LONGEST_LAT_CACHE.REFERE This event counts requests originating from the See Table 19-1.
NCE core that reference a cache line in the last level
cache.
2EH 41H LONGEST_LAT_CACHE.MISS This event counts each cache miss condition for See Table 19-1.
references to the last level cache.
3CH 00H CPU_CLK_UNHALTED.THREAD Counts the number of thread cycles while the See Table 19-1.
_P thread is not in a halt state. The thread enters the
halt state when it is running the HLT instruction.
The core frequency may change from time to time
due to power or thermal throttling.
3CH 01H CPU_CLK_THREAD_UNHALTED Increments at the frequency of XCLK (100 MHz) See Table 19-1.
.REF_XCLK when not halted.
48H 01H L1D_PEND_MISS.PENDING Increments the number of outstanding L1D misses PMC2 only;
every cycle. Set Cmask = 1 and Edge =1 to count Set Cmask = 1 to count
occurrences. cycles.
49H 01H DTLB_STORE_MISSES.MISS_CA Miss in all TLB levels causes a page walk of any page
USES_A_WALK size (4K/2M/4M/1G).
49H 02H DTLB_STORE_MISSES.WALK_C Miss in all TLB levels causes a page walk that
OMPLETED completes of any page size (4K/2M/4M/1G).
49H 04H DTLB_STORE_MISSES.WALK_D Cycles PMH is busy with this walk.
URATION
49H 10H DTLB_STORE_MISSES.STLB_HI Store operations that miss the first TLB level but hit
T the second and do not cause page walks.
4CH 01H LOAD_HIT_PRE.SW_PF Not SW-prefetch load dispatches that hit fill buffer
allocated for S/W prefetch.
4CH 02H LOAD_HIT_PRE.HW_PF Not SW-prefetch load dispatches that hit fill buffer
allocated for H/W prefetch.
4EH 02H HW_PRE_REQ.DL1_MISS Hardware Prefetch requests that miss the L1D This accounts for both L1
cache. A request is being counted each time it streamer and IP-based
access the cache & miss it, including if a block is (IPP) HW prefetchers.
applicable or if hit the Fill Buffer for example.
51H 01H L1D.REPLACEMENT Counts the number of lines brought into the L1 data
cache.
51H 02H L1D.ALLOCATED_IN_M Counts the number of allocations of modified L1D
cache lines.
51H 04H L1D.EVICTION Counts the number of modified lines evicted from
the L1 data cache due to replacement.
19-88 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
51H 08H L1D.ALL_M_REPLACEMENT Cache lines in M state evicted out of L1D due to
Snoop HitM or dirty line replacement.
59H 20H PARTIAL_RAT_STALLS.FLAGS_ Increments the number of flags-merge uops in flight
MERGE_UOP each cycle. Set Cmask = 1 to count cycles.
59H 40H PARTIAL_RAT_STALLS.SLOW_ Cycles with at least one slow LEA uop allocated.
LEA_WINDOW
59H 80H PARTIAL_RAT_STALLS.MUL_SI Number of Multiply packed/scalar single precision
NGLE_UOP uops allocated.
5BH 0CH RESOURCE_STALLS2.ALL_FL_ Cycles stalled due to free list empty. PMC0-3 only regardless
EMPTY HTT.
5BH 0FH RESOURCE_STALLS2.ALL_PRF Cycles stalled due to control structures full for
_CONTROL physical registers.
5BH 40H RESOURCE_STALLS2.BOB_FUL Cycles Allocator is stalled due Branch Order Buffer.
L
5BH 4FH RESOURCE_STALLS2.OOO_RS Cycles stalled due to out of order resources full.
RC
5CH 01H CPL_CYCLES.RING0 Unhalted core cycles when the thread is in ring 0. Use Edge to count
transition.
5CH 02H CPL_CYCLES.RING123 Unhalted core cycles when the thread is not in ring
0.
5EH 01H RS_EVENTS.EMPTY_CYCLES Cycles the RS is empty for the thread.
60H 01H OFFCORE_REQUESTS_OUTSTA Offcore outstanding Demand Data Read
NDING.DEMAND_DATA_RD transactions in SQ to uncore. Set Cmask=1 to count
cycles.
60H 04H OFFCORE_REQUESTS_OUTSTA Offcore outstanding RFO store transactions in SQ to
NDING.DEMAND_RFO uncore. Set Cmask=1 to count cycles.
60H 08H OFFCORE_REQUESTS_OUTSTA Offcore outstanding cacheable data read
NDING.ALL_DATA_RD transactions in SQ to uncore. Set Cmask=1 to count
cycles.
63H 01H LOCK_CYCLES.SPLIT_LOCK_UC Cycles in which the L1D and L2 are locked, due to a
_LOCK_DURATION UC lock or split lock.
63H 02H LOCK_CYCLES.CACHE_LOCK_D Cycles in which the L1D is locked.
URATION
79H 02H IDQ.EMPTY Counts cycles the IDQ is empty.
79H 04H IDQ.MITE_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H
from MITE path. Set Cmask = 1 to count cycles. and 20H.
79H 08H IDQ.DSB_UOPS Increment each cycle. # of uops delivered to IDQ Can combine Umask 08H
from DSB path. Set Cmask = 1 to count cycles. and 10H.
79H 10H IDQ.MS_DSB_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 08H
when MS busy by DSB. Set Cmask = 1 to count and 10H.
cycles MS is busy. Set Cmask=1 and Edge =1 to
count MS activations.
79H 20H IDQ.MS_MITE_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H
when MS is busy by MITE. Set Cmask = 1 to count and 20H.
cycles.
Vol. 3B 19-89
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
79H 30H IDQ.MS_UOPS Increment each cycle # of uops delivered to IDQ Can combine Umask 04H,
from MS by either DSB or MITE. Set Cmask = 1 to 08H and 30H.
count cycles.
80H 02H ICACHE.MISSES Number of Instruction Cache, Streaming Buffer and
Victim Cache Misses. Includes UC accesses.
85H 01H ITLB_MISSES.MISS_CAUSES_A Misses in all ITLB levels that cause page walks.
_WALK
85H 02H ITLB_MISSES.WALK_COMPLET Misses in all ITLB levels that cause completed page
ED walks.
85H 04H ITLB_MISSES.WALK_DURATIO Cycle PMH is busy with a walk.
N
85H 10H ITLB_MISSES.STLB_HIT Number of cache load STLB hits. No page walk.
87H 01H ILD_STALL.LCP Stalls caused by changing prefix length of the
instruction.
87H 04H ILD_STALL.IQ_FULL Stall cycles due to IQ is full.
88H 41H BR_INST_EXEC.NONTAKEN_CO Not-taken macro conditional branches.
NDITIONAL
88H 81H BR_INST_EXEC.TAKEN_CONDI Taken speculative and retired conditional branches.
TIONAL
88H 82H BR_INST_EXEC.TAKEN_DIRECT Taken speculative and retired conditional branches
_JUMP excluding calls and indirects.
88H 84H BR_INST_EXEC.TAKEN_INDIRE Taken speculative and retired indirect branches
CT_JUMP_NON_CALL_RET excluding calls and returns.
88H 88H BR_INST_EXEC.TAKEN_INDIRE Taken speculative and retired indirect branches that
CT_NEAR_RETURN are returns.
88H 90H BR_INST_EXEC.TAKEN_DIRECT Taken speculative and retired direct near calls.
_NEAR_CALL
88H A0H BR_INST_EXEC.TAKEN_INDIRE Taken speculative and retired indirect near calls.
CT_NEAR_CALL
88H C1H BR_INST_EXEC.ALL_CONDITIO Speculative and retired conditional branches.
NAL
88H C2H BR_INST_EXEC.ALL_DIRECT_J Speculative and retired conditional branches
UMP excluding calls and indirects.
88H C4H BR_INST_EXEC.ALL_INDIRECT Speculative and retired indirect branches excluding
_JUMP_NON_CALL_RET calls and returns.
88H C8H BR_INST_EXEC.ALL_INDIRECT Speculative and retired indirect branches that are
_NEAR_RETURN returns.
88H D0H BR_INST_EXEC.ALL_NEAR_CA Speculative and retired direct near calls.
LL
88H FFH BR_INST_EXEC.ALL_BRANCHE Speculative and retired branches.
S
89H 41H BR_MISP_EXEC.NONTAKEN_CO Not-taken mispredicted macro conditional branches.
NDITIONAL
89H 81H BR_MISP_EXEC.TAKEN_CONDI Taken speculative and retired mispredicted
TIONAL conditional branches.
19-90 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
89H 84H BR_MISP_EXEC.TAKEN_INDIRE Taken speculative and retired mispredicted indirect
CT_JUMP_NON_CALL_RET branches excluding calls and returns.
89H 88H BR_MISP_EXEC.TAKEN_RETUR Taken speculative and retired mispredicted indirect
N_NEAR branches that are returns.
89H 90H BR_MISP_EXEC.TAKEN_DIRECT Taken speculative and retired mispredicted direct
_NEAR_CALL near calls.
89H A0H BR_MISP_EXEC.TAKEN_INDIRE Taken speculative and retired mispredicted indirect
CT_NEAR_CALL near calls.
89H C1H BR_MISP_EXEC.ALL_CONDITIO Speculative and retired mispredicted conditional
NAL branches.
89H C4H BR_MISP_EXEC.ALL_INDIRECT Speculative and retired mispredicted indirect
_JUMP_NON_CALL_RET branches excluding calls and returns.
89H D0H BR_MISP_EXEC.ALL_NEAR_CA Speculative and retired mispredicted direct near
LL calls.
89H FFH BR_MISP_EXEC.ALL_BRANCHE Speculative and retired mispredicted branches.
S
9CH 01H IDQ_UOPS_NOT_DELIVERED.C Count issue pipeline slots where no uop was Use Cmask to qualify uop
ORE delivered from the front end to the back end when b/w.
there is no back-end stall.
A1H 01H UOPS_DISPATCHED_PORT.POR Cycles which a Uop is dispatched on port 0.
T_0
A1H 02H UOPS_DISPATCHED_PORT.POR Cycles which a Uop is dispatched on port 1.
T_1
A1H 0CH UOPS_DISPATCHED_PORT.POR Cycles which a Uop is dispatched on port 2.
T_2
A1H 30H UOPS_DISPATCHED_PORT.POR Cycles which a Uop is dispatched on port 3.
T_3
A1H 40H UOPS_DISPATCHED_PORT.POR Cycles which a Uop is dispatched on port 4.
T_4
A1H 80H UOPS_DISPATCHED_PORT.POR Cycles which a Uop is dispatched on port 5.
T_5
A2H 01H RESOURCE_STALLS.ANY Cycles Allocation is stalled due to Resource Related
reason.
A2H 02H RESOURCE_STALLS.LB Counts the cycles of stall due to lack of load buffers.
A2H 04H RESOURCE_STALLS.RS Cycles stalled due to no eligible RS entry available.
A2H 08H RESOURCE_STALLS.SB Cycles stalled due to no store buffers available (not
including draining form sync).
A2H 10H RESOURCE_STALLS.ROB Cycles stalled due to re-order buffer full.
A2H 20H RESOURCE_STALLS.FCSW Cycles stalled due to writing the FPU control word.
A3H 01H CYCLE_ACTIVITY.CYCLES_L2_P Cycles with pending L2 miss loads. Set AnyThread
ENDING to count per core.
A3H 02H CYCLE_ACTIVITY.CYCLES_L1D_ Cycles with pending L1 cache miss loads. Set PMC2 only.
PENDING AnyThread to count per core.
A3H 04H CYCLE_ACTIVITY.CYCLES_NO_ Cycles of dispatch stalls. Set AnyThread to count per PMC0-3 only.
DISPATCH core.
Vol. 3B 19-91
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
A3H 05H CYCLE_ACTIVITY.STALL_CYCLE PMC0-3 only.
S_L2_PENDING
A3H 06H CYCLE_ACTIVITY.STALL_CYCLE PMC2 only.
S_L1D_PENDING
A8H 01H LSD.UOPS Number of Uops delivered by the LSD.
ABH 01H DSB2MITE_SWITCHES.COUNT Number of DSB to MITE switches.
ABH 02H DSB2MITE_SWITCHES.PENALT Cycles DSB to MITE switches caused delay.
Y_CYCLES
ACH 02H DSB_FILL.OTHER_CANCEL Cases of cancelling valid DSB fill not because of
exceeding way limit.
ACH 08H DSB_FILL.EXCEED_DSB_LINES DSB Fill encountered > 3 DSB lines.
AEH 01H ITLB.ITLB_FLUSH Counts the number of ITLB flushes; includes
4k/2M/4M pages.
B0H 01H OFFCORE_REQUESTS.DEMAND Demand data read requests sent to uncore.
_DATA_RD
B0H 04H OFFCORE_REQUESTS.DEMAND Demand RFO read requests sent to uncore, including
_RFO regular RFOs, locks, ItoM.
B0H 08H OFFCORE_REQUESTS.ALL_DAT Data read requests sent to uncore (demand and
A_RD prefetch).
B1H 01H UOPS_DISPATCHED.THREAD Counts total number of uops to be dispatched per- PMC0-3 only regardless
thread each cycle. Set Cmask = 1, INV =1 to count HTT.
stall cycles.
B1H 02H UOPS_DISPATCHED.CORE Counts total number of uops to be dispatched per- Do not need to set ANY.
core each cycle.
B2H 01H OFFCORE_REQUESTS_BUFFER Offcore requests buffer cannot take more entries
.SQ_FULL for this thread core.
B6H 01H AGU_BYPASS_CANCEL.COUNT Counts executed load operations with all the
following traits: 1. Addressing of the format [base +
offset], 2. The offset is between 1 and 2047, 3. The
address specified in the base register is in one page
and the address [base+offset] is in another page.
B7H 01H OFF_CORE_RESPONSE_0 See Section 18.3.4.5, “Off-core Response Requires MSR 01A6H.
Performance Monitoring”.
BBH 01H OFF_CORE_RESPONSE_1 See Section 18.3.4.5, “Off-core Response Requires MSR 01A7H.
Performance Monitoring”.
BDH 01H TLB_FLUSH.DTLB_THREAD DTLB flush attempts of the thread-specific entries.
BDH 20H TLB_FLUSH.STLB_ANY Count number of STLB flush attempts.
BFH 05H L1D_BLOCKS.BANK_CONFLICT Cycles when dispatched loads are cancelled due to Cmask=1.
_CYCLES L1D bank conflicts with other load ports.
C0H 00H INST_RETIRED.ANY_P Number of instructions at retirement. See Table 19-1.
C0H 01H INST_RETIRED.PREC_DIST Precise instruction retired event with HW to reduce PMC1 only; must quiesce
effect of PEBS shadow in IP distribution. other PMCs.
C1H 02H OTHER_ASSISTS.ITLB_MISS_R Instructions that experienced an ITLB miss.
ETIRED
19-92 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C1H 08H OTHER_ASSISTS.AVX_STORE Number of assists associated with 256-bit AVX
store operations.
C1H 10H OTHER_ASSISTS.AVX_TO_SSE Number of transitions from AVX-256 to legacy SSE
when penalty applicable.
C1H 20H OTHER_ASSISTS.SSE_TO_AVX Number of transitions from SSE to AVX-256 when
penalty applicable.
C2H 01H UOPS_RETIRED.ALL Counts the number of micro-ops retired, Use Supports PEBS.
cmask=1 and invert to count active cycles or stalled
cycles.
C2H 02H UOPS_RETIRED.RETIRE_SLOTS Counts the number of retirement slots used each Supports PEBS.
cycle.
C3H 02H MACHINE_CLEARS.MEMORY_O Counts the number of machine clears due to
RDERING memory order conflicts.
C3H 04H MACHINE_CLEARS.SMC Counts the number of times that a program writes
to a code section.
C3H 20H MACHINE_CLEARS.MASKMOV Counts the number of executed AVX masked load
operations that refer to an illegal address range
with the mask bits set to 0.
C4H 00H BR_INST_RETIRED.ALL_BRAN Branch instructions at retirement. See Table 19-1.
CHES
C4H 01H BR_INST_RETIRED.CONDITION Counts the number of conditional branch Supports PEBS.
AL instructions retired.
C4H 02H BR_INST_RETIRED.NEAR_CALL Direct and indirect near call instructions retired. Supports PEBS.
C4H 04H BR_INST_RETIRED.ALL_BRAN Counts the number of branch instructions retired. Supports PEBS.
CHES
C4H 08H BR_INST_RETIRED.NEAR_RET Counts the number of near return instructions Supports PEBS.
URN retired.
C4H 10H BR_INST_RETIRED.NOT_TAKE Counts the number of not taken branch instructions
N retired.
C4H 20H BR_INST_RETIRED.NEAR_TAK Number of near taken branches retired. Supports PEBS.
EN
C4H 40H BR_INST_RETIRED.FAR_BRAN Number of far branches retired.
CH
C5H 00H BR_MISP_RETIRED.ALL_BRAN Mispredicted branch instructions at retirement. See Table 19-1.
CHES
C5H 01H BR_MISP_RETIRED.CONDITION Mispredicted conditional branch instructions retired. Supports PEBS.
AL
C5H 02H BR_MISP_RETIRED.NEAR_CAL Direct and indirect mispredicted near call Supports PEBS.
L instructions retired.
C5H 04H BR_MISP_RETIRED.ALL_BRAN Mispredicted macro branch instructions retired. Supports PEBS.
CHES
C5H 10H BR_MISP_RETIRED.NOT_TAKE Mispredicted not taken branch instructions retired. Supports PEBS.
N
C5H 20H BR_MISP_RETIRED.TAKEN Mispredicted taken branch instructions retired. Supports PEBS.
CAH 02H FP_ASSIST.X87_OUTPUT Number of X87 assists due to output value.
Vol. 3B 19-93
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
CAH 04H FP_ASSIST.X87_INPUT Number of X87 assists due to input value.
CAH 08H FP_ASSIST.SIMD_OUTPUT Number of SIMD FP assists due to output values.
CAH 10H FP_ASSIST.SIMD_INPUT Number of SIMD FP assists due to input values.
CAH 1EH FP_ASSIST.ANY Cycles with any input/output SSE* or FP assists.
CCH 20H ROB_MISC_EVENTS.LBR_INSE Count cases of saving new LBR records by
RTS hardware.
CDH 01H MEM_TRANS_RETIRED.LOAD_ Randomly sampled loads whose latency is above a Specify threshold in MSR
LATENCY user defined threshold. A small fraction of the 3F6H.
overall loads are sampled due to randomization.
PMC3 only.
CDH 02H MEM_TRANS_RETIRED.PRECIS Sample stores and collect precise store operation See Section 18.3.4.4.3.
E_STORE via PEBS record. PMC3 only.
D0H 11H MEM_UOPS_RETIRED.STLB_MI Retired load uops that miss the STLB. Supports PEBS. PMC0-3
SS_LOADS only regardless HTT.
D0H 12H MEM_UOPS_RETIRED.STLB_MI Retired store uops that miss the STLB. Supports PEBS. PMC0-3
SS_STORES only regardless HTT.
D0H 21H MEM_UOPS_RETIRED.LOCK_LO Retired load uops with locked access. Supports PEBS. PMC0-3
ADS only regardless HTT.
D0H 41H MEM_UOPS_RETIRED.SPLIT_L Retired load uops that split across a cacheline Supports PEBS. PMC0-3
OADS boundary. only regardless HTT.
D0H 42H MEM_UOPS_RETIRED.SPLIT_S Retired store uops that split across a cacheline Supports PEBS. PMC0-3
TORES boundary. only regardless HTT.
D0H 81H MEM_UOPS_RETIRED.ALL_LOA All retired load uops. Supports PEBS. PMC0-3
DS only regardless HTT.
D0H 82H MEM_UOPS_RETIRED.ALL_STO All retired store uops. Supports PEBS. PMC0-3
RES only regardless HTT.
D1H 01H MEM_LOAD_UOPS_RETIRED.L Retired load uops with L1 cache hits as data Supports PEBS. PMC0-3
1_HIT sources. only regardless HTT.
D1H 02H MEM_LOAD_UOPS_RETIRED.L Retired load uops with L2 cache hits as data Supports PEBS.
2_HIT sources.
D1H 04H MEM_LOAD_UOPS_RETIRED.LL Retired load uops which data sources were data hits Supports PEBS.
C_HIT in LLC without snoops required.
D1H 20H MEM_LOAD_UOPS_RETIRED.LL Retired load uops which data sources were data Supports PEBS.
C_MISS missed LLC (excluding unknown data source).
D1H 40H MEM_LOAD_UOPS_RETIRED.HI Retired load uops which data sources were load Supports PEBS.
T_LFB uops missed L1 but hit FB due to preceding miss to
the same cache line with data not ready.
D2H 01H MEM_LOAD_UOPS_LLC_HIT_R Retired load uops whose data source was an on- Supports PEBS.
ETIRED.XSNP_MISS package core cache LLC hit and cross-core snoop
missed.
D2H 02H MEM_LOAD_UOPS_LLC_HIT_R Retired load uops whose data source was an on- Supports PEBS.
ETIRED.XSNP_HIT package LLC hit and cross-core snoop hits.
D2H 04H MEM_LOAD_UOPS_LLC_HIT_R Retired load uops whose data source was an on- Supports PEBS.
ETIRED.XSNP_HITM package core cache with HitM responses.
D2H 08H MEM_LOAD_UOPS_LLC_HIT_R Retired load uops whose data source was LLC hit Supports PEBS.
ETIRED.XSNP_NONE with no snoop required.
19-94 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-17. Performance Events In the Processor Core Common to 2nd Generation Intel® Core™ i7-2xxx, Intel®
Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series and Intel® Xeon® Processors E3 and E5 Family (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
E6H 01H BACLEARS.ANY Counts the number of times the front end is re-
steered, mainly when the BPU cannot provide a
correct prediction and this is corrected by other
branch handling mechanisms at the front end.
F0H 01H L2_TRANS.DEMAND_DATA_RD Demand Data Read requests that access L2 cache.
F0H 02H L2_TRANS.RFO RFO requests that access L2 cache.
F0H 04H L2_TRANS.CODE_RD L2 cache accesses when fetching instructions.
F0H 08H L2_TRANS.ALL_PF L2 or LLC HW prefetches that access L2 cache. Including rejects.
F0H 10H L2_TRANS.L1D_WB L1D writebacks that access L2 cache.
F0H 20H L2_TRANS.L2_FILL L2 fill requests that access L2 cache.
F0H 40H L2_TRANS.L2_WB L2 writebacks that access L2 cache.
F0H 80H L2_TRANS.ALL_REQUESTS Transactions accessing L2 pipe.
F1H 01H L2_LINES_IN.I L2 cache lines in I state filling L2. Counting does not cover
rejects.
F1H 02H L2_LINES_IN.S L2 cache lines in S state filling L2. Counting does not cover
rejects.
F1H 04H L2_LINES_IN.E L2 cache lines in E state filling L2. Counting does not cover
rejects.
F1H 07H L2_LINES_IN.ALL L2 cache lines filling L2. Counting does not cover
rejects.
F2H 01H L2_LINES_OUT.DEMAND_CLEA Clean L2 cache lines evicted by demand.
N
F2H 02H L2_LINES_OUT.DEMAND_DIRT Dirty L2 cache lines evicted by demand.
Y
F2H 04H L2_LINES_OUT.PF_CLEAN Clean L2 cache lines evicted by L2 prefetch.
F2H 08H L2_LINES_OUT.PF_DIRTY Dirty L2 cache lines evicted by L2 prefetch.
F2H 0AH L2_LINES_OUT.DIRTY_ALL Dirty L2 cache lines filling the L2. Counting does not cover
rejects.
F4H 10H SQ_MISC.SPLIT_LOCK Split locks in SQ.
Non-architecture performance monitoring events in the processor core that are applicable only to Intel processors
with CPUID signature of DisplayFamily_DisplayModel 06_2AH are listed in Table 19-18.
Table 19-18. Performance Events applicable only to the Processor core for 2nd Generation Intel® Core™ i7-2xxx,
Intel® Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series
Event Umask
Num. Value Event Mask Mnemonic Description Comment
D2H 01H MEM_LOAD_UOPS_LLC_HIT_R Retired load uops which data sources were LLC hit and Supports PEBS. PMC0-
ETIRED.XSNP_MISS cross-core snoop missed in on-pkg core cache. 3 only regardless HTT.
D2H 02H MEM_LOAD_UOPS_LLC_HIT_R Retired load uops which data sources were LLC and Supports PEBS.
ETIRED.XSNP_HIT cross-core snoop hits in on-pkg core cache.
D2H 04H MEM_LOAD_UOPS_LLC_HIT_R Retired load uops which data sources were HitM Supports PEBS.
ETIRED.XSNP_HITM responses from shared LLC.
D2H 08H MEM_LOAD_UOPS_LLC_HIT_R Retired load uops which data sources were hits in LLC Supports PEBS.
ETIRED.XSNP_NONE without snoops required.
Vol. 3B 19-95
PERFORMANCE MONITORING EVENTS
Table 19-18. Performance Events applicable only to the Processor core for 2nd Generation Intel® Core™ i7-2xxx,
Intel® Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
D4H 02H MEM_LOAD_UOPS_MISC_RETI Retired load uops with unknown information as data Supports PEBS. PMC0-
RED.LLC_MISS source in cache serviced the load. 3 only regardless HTT.
B7H/BBH 01H OFF_CORE_RESPONSE_N Sub-events of OFF_CORE_RESPONSE_N (suffix N = 0,
1) programmed using MSR 01A6H/01A7H with values
shown in the comment column.
OFFCORE_RESPONSE.ALL_CODE_RD.LLC_HIT_N 10003C0244H
OFFCORE_RESPONSE.ALL_CODE_RD.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0244H
OFFCORE_RESPONSE.ALL_CODE_RD.LLC_HIT.SNOOP_MISS_N 2003C0244H
OFFCORE_RESPONSE.ALL_CODE_RD.LLC_HIT.MISS_DRAM_N 300400244H
OFFCORE_RESPONSE.ALL_DATA_RD.LLC_HIT.ANY_RESPONSE_N 3F803C0091H
OFFCORE_RESPONSE.ALL_DATA_RD.LLC_MISS.DRAM_N 300400091H
OFFCORE_RESPONSE.ALL_PF_CODE_RD.LLC_HIT.ANY_RESPONSE_N 3F803C0240H
OFFCORE_RESPONSE.ALL_PF_CODE_RD.LLC_HIT.HIT_OTHER_CORE_NO_FWD_N 4003C0240H
OFFCORE_RESPONSE.ALL_PF_CODE_RD.LLC_HIT.HITM_OTHER_CORE_N 10003C0240H
OFFCORE_RESPONSE.ALL_PF_CODE_RD.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0240H
OFFCORE_RESPONSE.ALL_PF_CODE_RD.LLC_HIT.SNOOP_MISS_N 2003C0240H
OFFCORE_RESPONSE.ALL_PF_CODE_RD.LLC_MISS.DRAM_N 300400240H
OFFCORE_RESPONSE.ALL_PF_DATA_RD.LLC_MISS.DRAM_N 300400090H
OFFCORE_RESPONSE.ALL_PF_RFO.LLC_HIT.ANY_RESPONSE_N 3F803C0120H
OFFCORE_RESPONSE.ALL_PF_RFO.LLC_HIT.HIT_OTHER_CORE_NO_FWD_N 4003C0120H
OFFCORE_RESPONSE.ALL_PF_RFO.LLC_HIT.HITM_OTHER_CORE_N 10003C0120H
OFFCORE_RESPONSE.ALL_PF_RfO.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0120H
OFFCORE_RESPONSE.ALL_PF_RFO.LLC_HIT.SNOOP_MISS_N 2003C0120H
OFFCORE_RESPONSE.ALL_PF_RFO.LLC_MISS.DRAM_N 300400120H
OFFCORE_RESPONSE.ALL_READS.LLC_MISS.DRAM_N 3004003F7H
OFFCORE_RESPONSE.ALL_RFO.LLC_HIT.ANY_RESPONSE_N 3F803C0122H
OFFCORE_RESPONSE.ALL_RFO.LLC_HIT.HIT_OTHER_CORE_NO_FWD_N 4003C0122H
OFFCORE_RESPONSE.ALL_RFO.LLC_HIT.HITM_OTHER_CORE_N 10003C0122H
OFFCORE_RESPONSE.ALL_RFO.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0122H
OFFCORE_RESPONSE.ALL_RFO.LLC_HIT.SNOOP_MISS_N 2003C0122H
OFFCORE_RESPONSE.ALL_RFO.LLC_MISS.DRAM_N 300400122H
OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_HIT.HIT_OTHER_CORE_NO_FWD_N 4003C0004H
OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_HIT.HITM_OTHER_CORE_N 10003C0004H
OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0004H
OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_HIT.SNOOP_MISS_N 2003C0004H
OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_MISS.DRAM_N 300400004H
OFFCORE_RESPONSE.DEMAND_DATA_RD.LLC_MISS.DRAM_N 300400001H
OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.ANY_RESPONSE_N 3F803C0002H
OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HIT_OTHER_CORE_NO_FWD_N 4003C0002H
OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE_N 10003C0002H
19-96 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-18. Performance Events applicable only to the Processor core for 2nd Generation Intel® Core™ i7-2xxx,
Intel® Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series (Contd.)
Event Umask
Num. Value Event Mask Mnemonic Description Comment
OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0002H
OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.SNOOP_MISS_N 2003C0002H
OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.DRAM_N 300400002H
OFFCORE_RESPONSE.OTHER.ANY_RESPONSE_N 18000H
OFFCORE_RESPONSE.PF_L2_CODE_RD.LLC_HIT.HIT_OTHER_CORE_NO_FWD_N 4003C0040H
OFFCORE_RESPONSE.PF_L2_CODE_RD.LLC_HIT.HITM_OTHER_CORE_N 10003C0040H
OFFCORE_RESPONSE.PF_L2_CODE_RD.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0040H
OFFCORE_RESPONSE.PF_L2_CODE_RD.LLC_HIT.SNOOP_MISS_N 2003C0040H
OFFCORE_RESPONSE.PF_L2_CODE_RD.LLC_MISS.DRAM_N 300400040H
OFFCORE_RESPONSE.PF_L2_DATA_RD.LLC_MISS.DRAM_N 300400010H
OFFCORE_RESPONSE.PF_L2_RFO.LLC_HIT.ANY_RESPONSE_N 3F803C0020H
OFFCORE_RESPONSE.PF_L2_RFO.LLC_HIT.HIT_OTHER_CORE_NO_FWD_N 4003C0020H
OFFCORE_RESPONSE.PF_L2_RFO.LLC_HIT.HITM_OTHER_CORE_N 10003C0020H
OFFCORE_RESPONSE.PF_L2_RFO.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0020H
OFFCORE_RESPONSE.PF_L2_RFO.LLC_HIT.SNOOP_MISS_N 2003C0020H
OFFCORE_RESPONSE.PF_L2_RFO.LLC_MISS.DRAM_N 300400020H
OFFCORE_RESPONSE.PF_LLC_CODE_RD.LLC_HIT.HIT_OTHER_CORE_NO_FWD_N 4003C0200H
OFFCORE_RESPONSE.PF_LLC_CODE_RD.LLC_HIT.HITM_OTHER_CORE_N 10003C0200H
OFFCORE_RESPONSE.PF_LLC_CODE_RD.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0200H
OFFCORE_RESPONSE.PF_LLC_CODE_RD.LLC_HIT.SNOOP_MISS_N 2003C0200H
OFFCORE_RESPONSE.PF_LLC_CODE_RD.LLC_MISS.DRAM_N 300400200H
OFFCORE_RESPONSE.PF_LLC_DATA_RD.LLC_MISS.DRAM_N 300400080H
OFFCORE_RESPONSE.PF_LLC_RFO.LLC_HIT.ANY_RESPONSE_N 3F803C0100H
OFFCORE_RESPONSE.PF_LLC_RFO.LLC_HIT.HIT_OTHER_CORE_NO_FWD_N 4003C0100H
OFFCORE_RESPONSE.PF_LLC_RFO.LLC_HIT.HITM_OTHER_CORE_N 10003C0100H
OFFCORE_RESPONSE.PF_LLC_RFO.LLC_HIT.NO_SNOOP_NEEDED_N 1003C0100H
OFFCORE_RESPONSE.PF_LLC_RFO.LLC_HIT.SNOOP_MISS_N 2003C0100H
OFFCORE_RESPONSE.PF_LLC_RFO.LLC_MISS.DRAM_N 300400100H
Non-architecture performance monitoring events in the processor core that are applicable only to Intel Xeon
processor E5 family (and Intel Core i7-3930 processor) based on Intel microarchitecture code name Sandy Bridge,
with CPUID signature of DisplayFamily_DisplayModel 06_2DH, are listed in Table 19-19.
Vol. 3B 19-97
PERFORMANCE MONITORING EVENTS
Model-specific performance monitoring events that are located in the uncore sub-system are implementation
specific between different platforms using processors based on Intel microarchitecture code name Sandy Bridge.
Processors with CPUID signature of DisplayFamily_DisplayModel 06_2AH support performance events listed in
Table 19-20.
19-98 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-20. Performance Events In the Processor Uncore for 2nd Generation
Intel® Core™ i7-2xxx, Intel® Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series
Event Umask
Num.1 Value Event Mask Mnemonic Description Comment
22H 01H UNC_CBO_XSNP_RESPONSE.M A snoop misses in some processor core. Must combine with
ISS one of the umask
22H 02H UNC_CBO_XSNP_RESPONSE.I A snoop invalidates a non-modified line in some values of 20H, 40H,
NVAL processor core. 80H.
34H 04H UNC_CBO_CACHE_LOOKUP.S LLC lookup request that access cache and found line in
S-state.
34H 08H UNC_CBO_CACHE_LOOKUP.I LLC lookup request that access cache and found line in
I-state.
34H 10H UNC_CBO_CACHE_LOOKUP.RE Filter on processor core initiated cacheable read
AD_FILTER requests. Must combine with at least one of 01H, 02H,
04H, 08H.
34H 20H UNC_CBO_CACHE_LOOKUP.WR Filter on processor core initiated cacheable write
ITE_FILTER requests. Must combine with at least one of 01H, 02H,
04H, 08H.
34H 40H UNC_CBO_CACHE_LOOKUP.EX Filter on external snoop requests. Must combine with
TSNP_FILTER at least one of 01H, 02H, 04H, 08H.
34H 80H UNC_CBO_CACHE_LOOKUP.AN Filter on any IRQ or IPQ initiated requests including
Y_REQUEST_FILTER uncacheable, non-coherent requests. Must combine
with at least one of 01H, 02H, 04H, 08H.
80H 01H UNC_ARB_TRK_OCCUPANCY.A Counts cycles weighted by the number of requests Counter 0 only.
LL waiting for data returning from the memory controller.
Accounts for coherent and non-coherent requests
initiated by IA cores, processor graphic units, or LLC.
81H 01H UNC_ARB_TRK_REQUEST.ALL Counts the number of coherent and in-coherent
requests initiated by IA cores, processor graphic units,
or LLC.
81H 20H UNC_ARB_TRK_REQUEST.WRI Counts the number of allocated write entries, include
TES full, partial, and LLC evictions.
81H 80H UNC_ARB_TRK_REQUEST.EVIC Counts the number of LLC evictions allocated.
TIONS
Vol. 3B 19-99
PERFORMANCE MONITORING EVENTS
Table 19-20. Performance Events In the Processor Uncore for 2nd Generation
Intel® Core™ i7-2xxx, Intel® Core™ i5-2xxx, Intel® Core™ i3-2xxx Processor Series (Contd.)
Event Umask
Num.1 Value Event Mask Mnemonic Description Comment
83H 01H UNC_ARB_COH_TRK_OCCUPA Cycles weighted by number of requests pending in Counter 0 only.
NCY.ALL Coherency Tracker.
84H 01H UNC_ARB_COH_TRK_REQUES Number of requests allocated in Coherency Tracker.
T.ALL
NOTES:
1. The uncore events must be programmed using MSRs located in specific performance monitoring units in the uncore. UNC_CBO*
events are supported using MSR_UNC_CBO* MSRs; UNC_ARB* events are supported using MSR_UNC_ARB*MSRs.
19-100 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-101
PERFORMANCE MONITORING EVENTS
19-102 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-103
PERFORMANCE MONITORING EVENTS
19-104 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-105
PERFORMANCE MONITORING EVENTS
19-106 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-107
PERFORMANCE MONITORING EVENTS
19-108 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-109
PERFORMANCE MONITORING EVENTS
19-110 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-111
PERFORMANCE MONITORING EVENTS
19-112 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-113
PERFORMANCE MONITORING EVENTS
19-114 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-115
PERFORMANCE MONITORING EVENTS
Model-specific performance monitoring events that are located in the uncore sub-system are implementation
specific between different platforms using processors based on Intel microarchitecture code name Nehalem.
Processors with CPUID signature of DisplayFamily_DisplayModel 06_1AH, 06_1EH, and 06_1FH support perfor-
mance events listed in Table 19-22.
19-116 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-117
PERFORMANCE MONITORING EVENTS
19-118 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-119
PERFORMANCE MONITORING EVENTS
19-120 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-121
PERFORMANCE MONITORING EVENTS
19-122 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-123
PERFORMANCE MONITORING EVENTS
19-124 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-125
PERFORMANCE MONITORING EVENTS
19-126 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-127
PERFORMANCE MONITORING EVENTS
Intel Xeon processors with CPUID signature of DisplayFamily_DisplayModel 06_2EH have a distinct uncore sub-
system that is significantly different from the uncore found in processors with CPUID signature 06_1AH, 06_1EH,
and 06_1FH. Model-specific performance monitoring events for its uncore will be available in future documentation.
19-128 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-129
PERFORMANCE MONITORING EVENTS
0FH 10H MEM_UNCORE_RETIRED.REMO Load instructions retired remote DRAM and remote Applicable to two
TE_DRAM home-remote cache HITM (Precise Event). sockets only.
0FH 20H MEM_UNCORE_RETIRED.OTHE Load instructions retired other LLC miss (Precise Applicable to two
R_LLC_MISS Event). sockets only.
0FH 80H MEM_UNCORE_RETIRED.UNCA Load instructions retired I/O (Precise Event). Applicable to one and
CHEABLE two sockets.
10H 01H FP_COMP_OPS_EXE.X87 Counts the number of FP Computational Uops
Executed. The number of FADD, FSUB, FCOM,
FMULs, integer MULs and IMULs, FDIVs, FPREMs,
FSQRTS, integer DIVs, and IDIVs. This event does
not distinguish an FADD used in the middle of a
transcendental flow from a separate FADD
instruction.
10H 02H FP_COMP_OPS_EXE.MMX Counts number of MMX Uops executed.
10H 04H FP_COMP_OPS_EXE.SSE_FP Counts number of SSE and SSE2 FP uops executed.
10H 08H FP_COMP_OPS_EXE.SSE2_INT Counts number of SSE2 integer uops executed.
EGER
10H 10H FP_COMP_OPS_EXE.SSE_FP_P Counts number of SSE FP packed uops executed.
ACKED
10H 20H FP_COMP_OPS_EXE.SSE_FP_S Counts number of SSE FP scalar uops executed.
CALAR
10H 40H FP_COMP_OPS_EXE.SSE_SING Counts number of SSE* FP single precision uops
LE_PRECISION executed.
10H 80H FP_COMP_OPS_EXE.SSE_DOU Counts number of SSE* FP double precision uops
BLE_PRECISION executed.
12H 01H SIMD_INT_128.PACKED_MPY Counts number of 128 bit SIMD integer multiply
operations.
12H 02H SIMD_INT_128.PACKED_SHIFT Counts number of 128 bit SIMD integer shift
operations.
12H 04H SIMD_INT_128.PACK Counts number of 128 bit SIMD integer pack
operations.
12H 08H SIMD_INT_128.UNPACK Counts number of 128 bit SIMD integer unpack
operations.
12H 10H SIMD_INT_128.PACKED_LOGIC Counts number of 128 bit SIMD integer logical
AL operations.
12H 20H SIMD_INT_128.PACKED_ARIT Counts number of 128 bit SIMD integer arithmetic
H operations.
12H 40H SIMD_INT_128.SHUFFLE_MOV Counts number of 128 bit SIMD integer shuffle and
E move operations.
19-130 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-131
PERFORMANCE MONITORING EVENTS
19-132 Vol. 3B
PERFORMANCE MONITORING EVENTS
27H 08H L2_WRITE.RFO.M_STATE Counts number of L2 store RFO requests where the This is a demand RFO
cache line to be loaded is in the M (modified) state. request.
The L1D prefetcher does not issue a RFO prefetch.
27H 0EH L2_WRITE.RFO.HIT Counts number of L2 store RFO requests where the This is a demand RFO
cache line to be loaded is in either the S, E or M request.
states. The L1D prefetcher does not issue a RFO
prefetch.
27H 0FH L2_WRITE.RFO.MESI Counts all L2 store RFO requests. The L1D This is a demand RFO
prefetcher does not issue a RFO prefetch. request.
27H 10H L2_WRITE.LOCK.I_STATE Counts number of L2 demand lock RFO requests
where the cache line to be loaded is in the I (invalid)
state, i.e., a cache miss.
27H 20H L2_WRITE.LOCK.S_STATE Counts number of L2 lock RFO requests where the
cache line to be loaded is in the S (shared) state.
27H 40H L2_WRITE.LOCK.E_STATE Counts number of L2 demand lock RFO requests
where the cache line to be loaded is in the E
(exclusive) state.
27H 80H L2_WRITE.LOCK.M_STATE Counts number of L2 demand lock RFO requests
where the cache line to be loaded is in the M
(modified) state.
27H E0H L2_WRITE.LOCK.HIT Counts number of L2 demand lock RFO requests
where the cache line to be loaded is in either the S,
E, or M state.
27H F0H L2_WRITE.LOCK.MESI Counts all L2 demand lock RFO requests.
Vol. 3B 19-133
PERFORMANCE MONITORING EVENTS
19-134 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-135
PERFORMANCE MONITORING EVENTS
19-136 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-137
PERFORMANCE MONITORING EVENTS
19-138 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-139
PERFORMANCE MONITORING EVENTS
19-140 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-141
PERFORMANCE MONITORING EVENTS
19-142 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-143
PERFORMANCE MONITORING EVENTS
19-144 Vol. 3B
PERFORMANCE MONITORING EVENTS
Model-specific performance monitoring events of the uncore sub-system for processors with CPUID signature of
DisplayFamily_DisplayModel 06_25H, 06_2CH, and 06_1FH support performance events listed in Table 19-24.
Vol. 3B 19-145
PERFORMANCE MONITORING EVENTS
19-146 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-147
PERFORMANCE MONITORING EVENTS
19-148 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-149
PERFORMANCE MONITORING EVENTS
19-150 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-151
PERFORMANCE MONITORING EVENTS
19-152 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-153
PERFORMANCE MONITORING EVENTS
19-154 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-155
PERFORMANCE MONITORING EVENTS
19-156 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-157
PERFORMANCE MONITORING EVENTS
19-158 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-159
PERFORMANCE MONITORING EVENTS
Table 19-25. Performance Events for Processors Based on Enhanced Intel Core Microarchitecture
Event Umask
Num. Value Event Mask Mnemonic Description Comment
C0H 08H INST_RETIRED.VM_HOST Instruction retired while in VMX root operations.
D2H 10H RAT_STAALS.OTHER_SERIALIZ This event counts the number of stalls due to other
ATION_STALLS RAT resource serialization not counted by Umask
value 0FH.
19-160 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-26. Fixed-Function Performance Counter and Pre-defined Performance Events (Contd.)
Fixed-Function Performance
Counter Address Event Mask Mnemonic Description
This event is not affected by core frequency changes
(e.g., P states) but counts at the same frequency as
the time stamp counter. This event can approximate
elapsed time while the core was not in halt state and
not in a TM stop-clock state.
This event has a constant ratio with the
CPU_CLK_UNHALTED.BUS event.
Table 19-27 lists general-purpose model-specific performance monitoring events supported in processors based on
Intel® Core™ microarchitecture. For convenience, Table 19-27 also includes architectural events and describes
minor model-specific behavior where applicable. Software must use a general-purpose performance counter to
count events listed in Table 19-27.
Vol. 3B 19-161
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
• The load is from bytes written by the preceding store,
the store is misaligned and the load is not aligned on the
beginning of the store.
• The load is split over an eight byte boundary (excluding
16-byte loads).
• The load and store have the same offset relative to the
beginning of different 4-KByte pages. This case is also
called 4-KByte aliasing.
• In all these cases the load is blocked until after the
blocking store retires and the stored data is committed to
the cache hierarchy.
03H 10H LOAD_BLOCK. Loads blocked until This event indicates that load operations were blocked until
UNTIL_RETIRE retirement. retirement. The number of events is greater or equal to the
number of load operations that were blocked.
This includes mainly uncacheable loads and split loads (loads
that cross the cache line boundary) but may include other
cases where loads are blocked until retirement.
03H 20H LOAD_BLOCK.L1D Loads blocked by the This event indicates that loads are blocked due to one or
L1 data cache. more reasons. Some triggers for this event are:
• The number of L1 data cache misses exceeds the
maximum number of outstanding misses supported by
the processor. This includes misses generated as result of
demand fetches, software prefetches or hardware
prefetches.
• Cache line split loads.
• Partial reads, such as reads to un-cacheable memory, I/O
instructions and more.
• A locked load operation is in progress. The number of
events is greater or equal to the number of load
operations that were blocked.
04H 01H SB_DRAIN_ Cycles while stores are This event counts every cycle during which the store buffer
CYCLES blocked due to store is draining. This includes:
buffer drain. • Serializing operations such as CPUID
• Synchronizing operations such as XCHG
• Interrupt acknowledgment
• Other conditions, such as cache flushing
04H 02H STORE_BLOCK. Cycles while store is This event counts the total duration, in number of cycles,
ORDER waiting for a which stores are waiting for a preceding stored cache line to
preceding store to be be observed by other cores.
globally observed. This situation happens as a result of the strong store
ordering behavior, as defined in “Memory Ordering,” Chapter
8, Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A.
19-162 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
06H 00H SEGMENT_REG_ Number of segment This event counts the number of segment register load
LOADS register loads. operations. Instructions that load new values into segment
registers cause a penalty.
This event indicates performance issues in 16-bit code. If
this event occurs frequently, it may be useful to calculate
the number of instructions retired per segment register
load. If the resulting calculation is low (on average a small
number of instructions are executed between segment
register loads), then the code’s segment register usage
should be optimized.
As a result of branch misprediction, this event is speculative
and may include segment register loads that do not actually
occur. However, most segment register loads are internally
serialized and such speculative effects are minimized.
07H 00H SSE_PRE_EXEC. Streaming SIMD This event counts the number of times the SSE instruction
NTA Extensions (SSE) prefetchNTA is executed.
Prefetch NTA This instruction prefetches the data to the L1 data cache.
instructions executed.
07H 01H SSE_PRE_EXEC.L1 Streaming SIMD This event counts the number of times the SSE instruction
Extensions (SSE) prefetchT0 is executed. This instruction prefetches the data
PrefetchT0 to the L1 data cache and L2 cache.
instructions executed.
07H 02H SSE_PRE_EXEC.L2 Streaming SIMD This event counts the number of times the SSE instructions
Extensions (SSE) prefetchT1 and prefetchT2 are executed. These
PrefetchT1 and instructions prefetch the data to the L2 cache.
PrefetchT2
instructions executed.
07H 03H SSE_PRE_ Streaming SIMD This event counts the number of times SSE non-temporal
EXEC.STORES Extensions (SSE) store instructions are executed.
Weakly-ordered store
instructions executed.
08H 01H DTLB_MISSES. Memory accesses that This event counts the number of Data Table Lookaside
ANY missed the DTLB. Buffer (DTLB) misses. The count includes misses detected
as a result of speculative accesses.
Typically a high count for this event indicates that the code
accesses a large number of data pages.
08H 02H DTLB_MISSES DTLB misses due to This event counts the number of Data Table Lookaside
.MISS_LD load operations. Buffer (DTLB) misses due to load operations.
This count includes misses detected as a result of
speculative accesses.
08H 04H DTLB_MISSES.L0_MISS_LD L0 DTLB misses due to This event counts the number of level 0 Data Table
load operations. Lookaside Buffer (DTLB0) misses due to load operations.
This count includes misses detected as a result of
speculative accesses. Loads that miss that DTLB0 and hit
the DTLB1 can incur two-cycle penalty.
Vol. 3B 19-163
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
08H 08H DTLB_MISSES. TLB misses due to This event counts the number of Data Table Lookaside
MISS_ST store operations. Buffer (DTLB) misses due to store operations.
19-164 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
13H 00H DIV Divide operations This event counts the number of divide operations
executed. executed. This includes integer divides, floating point
divides and square-root operations executed.
Use IA32_PMC1 only.
14H 00H CYCLES_DIV Cycles the divider This event counts the number of cycles the divider is busy
_BUSY busy. executing divide or square root operations. The divide can
be integer, X87 or Streaming SIMD Extensions (SSE). The
square root operation can be either X87 or SSE.
Use IA32_PMC0 only.
18H 00H IDLE_DURING Cycles the divider is This event counts the number of cycles the divider is busy
_DIV busy and all other (with a divide or a square root operation) and no other
execution units are execution unit or load operation is in progress.
idle. Load operations are assumed to hit the L1 data cache. This
event considers only micro-ops dispatched after the divider
started operating.
Use IA32_PMC0 only.
19H 00H DELAYED_ Delayed bypass to FP This event counts the number of times floating point
BYPASS.FP operation. operations use data immediately after the data was
generated by a non-floating point execution unit. Such cases
result in one penalty cycle due to data bypass between the
units.
Use IA32_PMC1 only.
19H 01H DELAYED_ Delayed bypass to This event counts the number of times SIMD operations use
BYPASS.SIMD SIMD operation. data immediately after the data was generated by a non-
SIMD execution unit. Such cases result in one penalty cycle
due to data bypass between the units.
Use IA32_PMC1 only.
19H 02H DELAYED_ Delayed bypass to This event counts the number of delayed bypass penalty
BYPASS.LOAD load operation. cycles that a load operation incurred.
When load operations use data immediately after the data
was generated by an integer execution unit, they may
(pending on certain dynamic internal conditions) incur one
penalty cycle due to delayed data bypass between the units.
Use IA32_PMC1 only.
21H See L2_ADS.(Core) Cycles L2 address bus This event counts the number of cycles the L2 address bus
Table is in use. is being used for accesses to the L2 cache or bus queue. It
18-71 can count occurrences for this core or both cores.
23H See L2_DBUS_BUSY Cycles the L2 This event counts the number of cycles during which the L2
Table _RD.(Core) transfers data to the data bus is busy transferring data from the L2 cache to the
18-71 core. core. It counts for all L1 cache misses (data and instruction)
that hit the L2 cache.
This event can count occurrences for this core or both cores.
Vol. 3B 19-165
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
24H Combin L2_LINES_IN. L2 cache misses. This event counts the number of cache lines allocated in the
ed mask (Core, Prefetch) L2 cache. Cache lines are allocated in the L2 cache as a
from result of requests from the L1 data and instruction caches
Table and the L2 hardware prefetchers to cache lines that are
18-71 missing in the L2 cache.
and This event can count occurrences for this core or both cores.
Table It can also count demand requests and L2 hardware
18-73 prefetch requests together or separately.
25H See L2_M_LINES_IN. L2 cache line This event counts whenever a modified cache line is written
Table (Core) modifications. back from the L1 data cache to the L2 cache.
18-71 This event can count occurrences for this core or both cores.
26H See L2_LINES_OUT. L2 cache lines evicted. This event counts the number of L2 cache lines evicted.
Table (Core, Prefetch) This event can count occurrences for this core or both cores.
18-71 It can also count evictions due to demand requests and L2
and hardware prefetch requests together or separately.
Table
18-73
27H See L2_M_LINES_OUT.(Core, Modified lines evicted This event counts the number of L2 modified cache lines
Table Prefetch) from the L2 cache. evicted. These lines are written back to memory unless they
18-71 also exist in a modified-state in one of the L1 data caches.
and This event can count occurrences for this core or both cores.
Table It can also count evictions due to demand requests and L2
18-73 hardware prefetch requests together or separately.
28H Com- L2_IFETCH.(Core, Cache L2 cacheable This event counts the number of instruction cache line
bined Line State) instruction fetch requests from the IFU. It does not include fetch requests
mask requests. from uncacheable memory. It does not include ITLB miss
from accesses.
Table This event can count occurrences for this core or both cores.
18-71 It can also count accesses to cache lines at different MESI
and states.
Table
18-74
29H Combin L2_LD.(Core, Prefetch, L2 cache reads. This event counts L2 cache read requests coming from the
ed mask Cache Line State) L1 data cache and L2 prefetchers.
from The event can count occurrences:
Table
• For this core or both cores.
18-71,
• Due to demand requests and L2 hardware prefetch
Table requests together or separately.
18-73, • Of accesses to cache lines at different MESI states.
and
Table
18-74
2AH See L2_ST.(Core, Cache Line L2 store requests. This event counts all store operations that miss the L1 data
Table State) cache and request the data from the L2 cache.
18-71 The event can count occurrences for this core or both cores.
and It can also count accesses to cache lines at different MESI
Table states.
18-74
19-166 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
2BH See L2_LOCK.(Core, Cache Line L2 locked accesses. This event counts all locked accesses to cache lines that
Table State) miss the L1 data cache.
18-71 The event can count occurrences for this core or both cores.
and It can also count accesses to cache lines at different MESI
Table states.
18-74
2EH See L2_RQSTS.(Core, Prefetch, L2 cache requests. This event counts all completed L2 cache requests. This
Table Cache Line State) includes L1 data cache reads, writes, and locked accesses,
18-71, L1 data prefetch requests, instruction fetches, and all L2
Table hardware prefetch requests.
18-73, This event can count occurrences:
and
• For this core or both cores.
Table
• Due to demand requests and L2 hardware prefetch
18-74 requests together, or separately.
• Of accesses to cache lines at different MESI states.
2EH 41H L2_RQSTS.SELF. L2 cache demand This event counts all completed L2 cache demand requests
DEMAND.I_STATE requests from this from this core that miss the L2 cache. This includes L1 data
core that missed the cache reads, writes, and locked accesses, L1 data prefetch
L2. requests, and instruction fetches.
This is an architectural performance event.
2EH 4FH L2_RQSTS.SELF. L2 cache demand This event counts all completed L2 cache demand requests
DEMAND.MESI requests from this from this core. This includes L1 data cache reads, writes,
core. and locked accesses, L1 data prefetch requests, and
instruction fetches.
This is an architectural performance event.
30H See L2_REJECT_BUSQ.(Core, Rejected L2 cache This event indicates that a pending L2 cache request that
Table Prefetch, Cache Line State) requests. requires a bus transaction is delayed from moving to the bus
18-71, queue. Some of the reasons for this event are:
Table • The bus queue is full.
18-73, • The bus queue already holds an entry for a cache line in
and the same set.
Table The number of events is greater or equal to the number of
18-74 requests that were rejected.
• For this core or both cores.
• Due to demand requests and L2 hardware prefetch
requests together, or separately.
• Of accesses to cache lines at different MESI states.
32H See L2_NO_REQ.(Core) Cycles no L2 cache This event counts the number of cycles that no L2 cache
Table requests are pending. requests were pending from a core. When using the
18-71 BOTH_CORE modifier, the event counts only if none of the
cores have a pending request. The event counts also when
one core is halted and the other is not halted.
The event can count occurrences for this core or both cores.
3AH 00H EIST_TRANS Number of Enhanced This event counts the number of transitions that include a
Intel SpeedStep frequency change, either with or without voltage change.
Technology (EIST) This includes Enhanced Intel SpeedStep Technology (EIST)
transitions. and TM2 transitions.
The event is incremented only while the counting core is in
C0 state. Since transitions to higher-numbered CxE states
and TM2 transitions include a frequency change or voltage
transition, the event is incremented accordingly.
Vol. 3B 19-167
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
3BH C0H THERMAL_TRIP Number of thermal This event counts the number of thermal trips. A thermal
trips. trip occurs whenever the processor temperature exceeds
the thermal trip threshold temperature.
Following a thermal trip, the processor automatically
reduces frequency and voltage. The processor checks the
temperature every millisecond and returns to normal when
the temperature falls below the thermal trip threshold
temperature.
3CH 00H CPU_CLK_ Core cycles when core This event counts the number of core cycles while the core
UNHALTED. is not halted. is not in a halt state. The core enters the halt state when it
CORE_P is running the HLT instruction. This event is a component in
many key event ratios.
The core frequency may change due to transitions
associated with Enhanced Intel SpeedStep Technology or
TM2. For this reason, this event may have a changing ratio in
regard to time.
When the core frequency is constant, this event can give
approximate elapsed time while the core not in halt state.
This is an architectural performance event.
3CH 01H CPU_CLK_ Bus cycles when core This event counts the number of bus cycles while the core is
UNHALTED.BUS is not halted. not in the halt state. This event can give a measurement of
the elapsed time while the core was not in the halt state.
The core enters the halt state when it is running the HLT
instruction.
The event also has a constant ratio with
CPU_CLK_UNHALTED.REF event, which is the maximum bus
to processor frequency ratio.
Non-halted bus cycles are a component in many key event
ratios.
3CH 02H CPU_CLK_ Bus cycles when core This event counts the number of bus cycles during which
UNHALTED.NO is active and the other the core remains non-halted and the other core on the
_OTHER is halted. processor is halted.
This event can be used to determine the amount of
parallelism exploited by an application or a system. Divide
this event count by the bus frequency to determine the
amount of time that only one core was in use.
40H See L1D_CACHE_LD. L1 cacheable data This event counts the number of data reads from cacheable
Table (Cache Line State) reads. memory. Locked reads are not counted.
18-74
41H See L1D_CACHE_ST. L1 cacheable data This event counts the number of data writes to cacheable
Table (Cache Line State) writes. memory. Locked writes are not counted.
18-74
42H See L1D_CACHE_ L1 data cacheable This event counts the number of locked data reads from
Table LOCK.(Cache Line State) locked reads. cacheable memory.
18-74
19-168 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
42H 10H L1D_CACHE_ Duration of L1 data This event counts the number of cycles during which any
LOCK_DURATION cacheable locked cache line is locked by any locking instruction.
operation. Locking happens at retirement and therefore the event does
not occur for instructions that are speculatively executed.
Locking duration is shorter than locked instruction execution
duration.
43H 01H L1D_ALL_REF All references to the This event counts all references to the L1 data cache,
L1 data cache. including all loads and stores with any memory types.
The event counts memory accesses only when they are
actually performed. For example, a load blocked by unknown
store address and later performed is only counted once.
The event includes non-cacheable accesses, such as I/O
accesses.
43H 02H L1D_ALL_ L1 Data cacheable This event counts the number of data reads and writes from
CACHE_REF reads and writes. cacheable memory, including locked operations.
This event is a sum of:
• L1D_CACHE_LD.MESI
• L1D_CACHE_ST.MESI
• L1D_CACHE_LOCK.MESI
45H 0FH L1D_REPL Cache lines allocated This event counts the number of lines brought into the L1
in the L1 data cache. data cache.
46H 00H L1D_M_REPL Modified cache lines This event counts the number of modified lines brought into
allocated in the L1 the L1 data cache.
data cache.
47H 00H L1D_M_EVICT Modified cache lines This event counts the number of modified lines evicted from
evicted from the L1 the L1 data cache, whether due to replacement or by snoop
data cache. HITM intervention.
48H 00H L1D_PEND_ Total number of This event counts the number of outstanding L1 data cache
MISS outstanding L1 data misses at any cycle. An L1 data cache miss is outstanding
cache misses at any from the cycle on which the miss is determined until the
cycle. first chunk of data is available. This event counts:
• All cacheable demand requests.
• L1 data cache hardware prefetch requests.
• Requests to write through memory.
• Requests to write combine memory.
Uncacheable requests are not counted. The count of this
event divided by the number of L1 data cache misses,
L1D_REPL, is the average duration in core cycles of an L1
data cache miss.
49H 01H L1D_SPLIT.LOADS Cache line split loads This event counts the number of load operations that span
from the L1 data two cache lines. Such load operations are also called split
cache. loads. Split load operations are executed at retirement.
49H 02H L1D_SPLIT. Cache line split stores This event counts the number of store operations that span
STORES to the L1 data cache. two cache lines.
4BH 00H SSE_PRE_ Streaming SIMD This event counts the number of times the SSE instructions
MISS.NTA Extensions (SSE) prefetchNTA were executed and missed all cache levels.
Prefetch NTA Due to speculation an executed instruction might not retire.
instructions missing all This instruction prefetches the data to the L1 data cache.
cache levels.
Vol. 3B 19-169
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
4BH 01H SSE_PRE_ Streaming SIMD This event counts the number of times the SSE instructions
MISS.L1 Extensions (SSE) prefetchT0 were executed and missed all cache levels.
PrefetchT0 Due to speculation executed instruction might not retire.
instructions missing all The prefetchT0 instruction prefetches data to the L2 cache
cache levels. and L1 data cache.
4BH 02H SSE_PRE_ Streaming SIMD This event counts the number of times the SSE instructions
MISS.L2 Extensions (SSE) prefetchT1 and prefetchT2 were executed and missed all
PrefetchT1 and cache levels.
PrefetchT2 Due to speculation, an executed instruction might not retire.
instructions missing all The prefetchT1 and PrefetchNT2 instructions prefetch data
cache levels. to the L2 cache.
4CH 00H LOAD_HIT_PRE Load operations This event counts load operations sent to the L1 data cache
conflicting with a while a previous Streaming SIMD Extensions (SSE) prefetch
software prefetch to instruction to the same cache line has started prefetching
the same address. but has not yet finished.
4EH 10H L1D_PREFETCH. L1 data cache prefetch This event counts the number of times the L1 data cache
REQUESTS requests. requested to prefetch a data cache line. Requests can be
rejected when the L2 cache is busy and resubmitted later or
lost.
All requests are counted, including those that are rejected.
60H See BUS_REQUEST_ Outstanding cacheable This event counts the number of pending full cache line read
Table OUTSTANDING. data read bus transactions on the bus occurring in each cycle. A read
18-71 (Core and Bus Agents) requests duration. transaction is pending from the cycle it is sent on the bus
and until the full cache line is received by the processor.
Table The event counts only full-line cacheable read requests from
18-72. either the L1 data cache or the L2 prefetchers. It does not
count Read for Ownership transactions, instruction byte
fetch transactions, or any other bus transaction.
61H See BUS_BNR_DRV. Number of Bus Not This event counts the number of Bus Not Ready (BNR)
Table (Bus Agents) Ready signals signals that the processor asserts on the bus to suspend
18-72. asserted. additional bus requests by other bus agents.
A bus agent asserts the BNR signal when the number of
data and snoop transactions is close to the maximum that
the bus can handle. To obtain the number of bus cycles
during which the BNR signal is asserted, multiply the event
count by two.
While this signal is asserted, new transactions cannot be
submitted on the bus. As a result, transaction latency may
have higher impact on program performance.
62H See BUS_DRDY_ Bus cycles when data This event counts the number of bus cycles during which
Table CLOCKS.(Bus Agents) is sent on the bus. the DRDY (Data Ready) signal is asserted on the bus. The
18-72. DRDY signal is asserted when data is sent on the bus. With
the 'THIS_AGENT' mask this event counts the number of bus
cycles during which this agent (the processor) writes data
on the bus back to memory or to other bus agents. This
includes all explicit and implicit data writebacks, as well as
partial writes.
With the 'ALL_AGENTS' mask, this event counts the number
of bus cycles during which any bus agent sends data on the
bus. This includes all data reads and writes on the bus.
19-170 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
63H See BUS_LOCK_ Bus cycles when a This event counts the number of bus cycles, during which
Table CLOCKS.(Core and Bus LOCK signal asserted. the LOCK signal is asserted on the bus. A LOCK signal is
18-71 Agents) asserted when there is a locked memory access, due to:
and • Uncacheable memory.
Table • Locked operation that spans two cache lines.
18-72. • Page-walk from an uncacheable page table.
Bus locks have a very high performance penalty and it is
highly recommended to avoid such accesses.
64H See BUS_DATA_ Bus cycles while This event counts the number of bus cycles during which
Table RCV.(Core) processor receives the processor is busy receiving data.
18-71. data.
65H See BUS_TRANS_BRD.(Core Burst read bus This event counts the number of burst read transactions
Table and Bus Agents) transactions. including:
18-71 • L1 data cache read misses (and L1 data cache hardware
and prefetches).
Table • L2 hardware prefetches by the DPL and L2 streamer.
18-72. • IFU read misses of cacheable lines.
It does not include RFO transactions.
66H See BUS_TRANS_RFO.(Core RFO bus transactions. This event counts the number of Read For Ownership (RFO)
Table and Bus Agents) bus transactions, due to store operations that miss the L1
18-71 data cache and the L2 cache. It also counts RFO bus
and transactions due to locked operations.
Table
18-72.
67H See BUS_TRANS_WB. Explicit writeback bus This event counts all explicit writeback bus transactions due
Table (Core and Bus Agents) transactions. to dirty line evictions. It does not count implicit writebacks
18-71 due to invalidation by a snoop request.
and
Table
18-72.
68H See BUS_TRANS_ Instruction-fetch bus This event counts all instruction fetch full cache line bus
Table IFETCH.(Core and Bus transactions. transactions.
18-71 Agents)
and
Table
18-72.
69H See BUS_TRANS_ Invalidate bus This event counts all invalidate transactions. Invalidate
Table INVAL.(Core and Bus transactions. transactions are generated when:
18-71 Agents) • A store operation hits a shared line in the L2 cache.
and • A full cache line write misses the L2 cache or hits a
Table shared line in the L2 cache.
18-72.
6AH See BUS_TRANS_ Partial write bus This event counts partial write bus transactions.
Table PWR.(Core and Bus Agents) transaction.
18-71
and
Table
18-72.
Vol. 3B 19-171
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
6BH See BUS_TRANS Partial bus This event counts all (read and write) partial bus
Table _P.(Core and Bus Agents) transactions. transactions.
18-71
and
Table
18-72.
6CH See BUS_TRANS_IO.(Core and IO bus transactions. This event counts the number of completed I/O bus
Table Bus Agents) transactions as a result of IN and OUT instructions. The
18-71 count does not include memory mapped IO.
and
Table
18-72.
6DH See BUS_TRANS_ Deferred bus This event counts the number of deferred transactions.
Table DEF.(Core and Bus Agents) transactions.
18-71
and
Table
18-72.
6EH See BUS_TRANS_ Burst (full cache-line) This event counts burst (full cache line) transactions
Table BURST.(Core and Bus bus transactions. including:
18-71 Agents) • Burst reads.
and • RFOs.
Table • Explicit writebacks.
18-72. • Write combine lines.
6FH See BUS_TRANS_ Memory bus This event counts all memory bus transactions including:
Table MEM.(Core and Bus Agents) transactions. • Burst transactions.
18-71 • Partial reads and writes - invalidate transactions.
and The BUS_TRANS_MEM count is the sum of
Table BUS_TRANS_BURST, BUS_TRANS_P and BUS_TRANS_IVAL.
18-72.
70H See BUS_TRANS_ All bus transactions. This event counts all bus transactions. This includes:
Table ANY.(Core and Bus Agents) • Memory transactions.
18-71 • IO transactions (non memory-mapped).
and • Deferred transaction completion.
Table • Other less frequent transactions, such as interrupts.
18-72.
77H See EXT_SNOOP. External snoops. This event counts the snoop responses to bus transactions.
Table (Bus Agents, Snoop Responses can be counted separately by type and by bus
18-71 Response) agent.
and With the 'THIS_AGENT' mask, the event counts snoop
Table responses from this processor to bus transactions sent by
18-75. this processor. With the 'ALL_AGENTS' mask the event
counts all snoop responses seen on the bus.
78H See CMP_SNOOP.(Core, Snoop L1 data cache This event counts the number of times the L1 data cache is
Table Type) snooped by other core. snooped for a cache line that is needed by the other core in
18-71 the same processor. The cache line is either missing in the
and L1 instruction or data caches of the other core, or is
Table available for reading only and the other core wishes to write
18-76. the cache line.
19-172 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
The snoop operation may change the cache line state. If the
other core issued a read request that hit this core in E state,
typically the state changes to S state in this core. If the
other core issued a read for ownership request (due a write
miss or hit to S state) that hits this core's cache line in E or S
state, this typically results in invalidation of the cache line in
this core. If the snoop hits a line in M state, the state is
changed at a later opportunity.
These snoops are performed through the L1 data cache
store port. Therefore, frequent snoops may conflict with
extensive stores to the L1 data cache, which may increase
store latency and impact performance.
7AH See BUS_HIT_DRV. HIT signal asserted. This event counts the number of bus cycles during which
Table (Bus Agents) the processor drives the HIT# pin to signal HIT snoop
18-72. response.
7BH See BUS_HITM_DRV. HITM signal asserted. This event counts the number of bus cycles during which
Table (Bus Agents) the processor drives the HITM# pin to signal HITM snoop
18-72. response.
7DH See BUSQ_EMPTY. Bus queue empty. This event counts the number of cycles during which the
Table (Core) core did not have any pending transactions in the bus queue.
18-71. It also counts when the core is halted and the other core is
not halted.
This event can count occurrences for this core or both cores.
7EH See SNOOP_STALL_ Bus stalled for snoops. This event counts the number of times that the bus snoop
Table DRV.(Core and Bus Agents) stall signal is asserted. To obtain the number of bus cycles
18-71 during which snoops on the bus are prohibited, multiply the
and event count by two.
Table During the snoop stall cycles, no new bus transactions
18-72. requiring a snoop response can be initiated on the bus. A
bus agent asserts a snoop stall signal if it cannot response
to a snoop request within three bus cycles.
7FH See BUS_IO_WAIT. IO requests waiting in This event counts the number of core cycles during which IO
Table (Core) the bus queue. requests wait in the bus queue. With the SELF modifier this
18-71. event counts IO requests per core.
With the BOTH_CORE modifier, this event increments by one
for any cycle for which there is a request from either core.
80H 00H L1I_READS Instruction fetches. This event counts all instruction fetches, including
uncacheable fetches that bypass the Instruction Fetch Unit
(IFU).
81H 00H L1I_MISSES Instruction Fetch Unit This event counts all instruction fetches that miss the
misses. Instruction Fetch Unit (IFU) or produce memory requests.
This includes uncacheable fetches.
An instruction fetch miss is counted only once and not once
for every cycle it is outstanding.
82H 02H ITLB.SMALL_MISS ITLB small page This event counts the number of instruction fetches from
misses. small pages that miss the ITLB.
82H 10H ITLB.LARGE_MISS ITLB large page This event counts the number of instruction fetches from
misses. large pages that miss the ITLB.
Vol. 3B 19-173
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
82H 40H ITLB.FLUSH ITLB flushes. This event counts the number of ITLB flushes. This usually
happens upon CR3 or CR0 writes, which are executed by
the operating system during process switches.
82H 12H ITLB.MISSES ITLB misses. This event counts the number of instruction fetches from
either small or large pages that miss the ITLB.
83H 02H INST_QUEUE.FULL Cycles during which This event counts the number of cycles during which the
the instruction queue instruction queue is full. In this situation, the core front end
is full. stops fetching more instructions. This is an indication of
very long stalls in the back-end pipeline stages.
86H 00H CYCLES_L1I_ Cycles during which This event counts the number of cycles for which an
MEM_STALLED instruction fetches instruction fetch stalls, including stalls due to any of the
stalled. following reasons:
• Instruction Fetch Unit cache misses.
• Instruction TLB misses.
• Instruction TLB faults.
87H 00H ILD_STALL Instruction Length This event counts the number of cycles during which the
Decoder stall cycles instruction length decoder uses the slow length decoder.
due to a length Usually, instruction length decoding is done in one cycle.
changing prefix. When the slow decoder is used, instruction decoding
requires 6 cycles.
The slow decoder is used in the following cases:
• Operand override prefix (66H) preceding an instruction
with immediate data.
• Address override prefix (67H) preceding an instruction
with a modr/m in real, big real, 16-bit protected or 32-bit
protected modes.
To avoid instruction length decoding stalls, generate code
using imm8 or imm32 values instead of imm16 values. If
you must use an imm16 value, store the value in a register
using “mov reg, imm32” and use the register format of the
instruction.
88H 00H BR_INST_EXEC Branch instructions This event counts all executed branches (not necessarily
executed. retired). This includes only instructions and not micro-op
branches.
Frequent branching is not necessarily a major performance
issue. However frequent branch mispredictions may be a
problem.
89H 00H BR_MISSP_EXEC Mispredicted branch This event counts the number of mispredicted branch
instructions executed. instructions that were executed.
8AH 00H BR_BAC_ Branch instructions This event counts the number of branch instructions that
MISSP_EXEC mispredicted at were mispredicted at decoding.
decoding.
8BH 00H BR_CND_EXEC Conditional branch This event counts the number of conditional branch
instructions executed. instructions executed, but not necessarily retired.
8CH 00H BR_CND_ Mispredicted This event counts the number of mispredicted conditional
MISSP_EXEC conditional branch branch instructions that were executed.
instructions executed.
8DH 00H BR_IND_EXEC Indirect branch This event counts the number of indirect branch instructions
instructions executed. that were executed.
19-174 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
8EH 00H BR_IND_MISSP Mispredicted indirect This event counts the number of mispredicted indirect
_EXEC branch instructions branch instructions that were executed.
executed.
8FH 00H BR_RET_EXEC RET instructions This event counts the number of RET instructions that were
executed. executed.
90H 00H BR_RET_ Mispredicted RET This event counts the number of mispredicted RET
MISSP_EXEC instructions executed. instructions that were executed.
91H 00H BR_RET_BAC_ RET instructions This event counts the number of RET instructions that were
MISSP_EXEC executed mispredicted executed and were mispredicted at decoding.
at decoding.
92H 00H BR_CALL_EXEC CALL instructions This event counts the number of CALL instructions
executed. executed.
93H 00H BR_CALL_ Mispredicted CALL This event counts the number of mispredicted CALL
MISSP_EXEC instructions executed. instructions that were executed.
94H 00H BR_IND_CALL_ Indirect CALL This event counts the number of indirect CALL instructions
EXEC instructions executed. that were executed.
97H 00H BR_TKN_ Branch predicted The events BR_TKN_BUBBLE_1 and BR_TKN_BUBBLE_2
BUBBLE_1 taken with bubble 1. together count the number of times a taken branch
prediction incurred a one-cycle penalty. The penalty incurs
when:
• Too many taken branches are placed together. To avoid
this, unroll loops and add a non-taken branch in the
middle of the taken sequence.
• The branch target is unaligned. To avoid this, align the
branch target.
98H 00H BR_TKN_ Branch predicted The events BR_TKN_BUBBLE_1 and BR_TKN_BUBBLE_2
BUBBLE_2 taken with bubble 2. together count the number of times a taken branch
prediction incurred a one-cycle penalty. The penalty incurs
when:
• Too many taken branches are placed together. To avoid
this, unroll loops and add a non-taken branch in the
middle of the taken sequence.
• The branch target is unaligned. To avoid this, align the
branch target.
A0H 00H RS_UOPS_ Micro-ops dispatched This event counts the number of micro-ops dispatched for
DISPATCHED for execution. execution. Up to six micro-ops can be dispatched in each
cycle.
A1H 01H RS_UOPS_ Cycles micro-ops This event counts the number of cycles for which micro-ops
DISPATCHED.PORT0 dispatched for dispatched for execution. Each cycle, at most one micro-op
execution on port 0. can be dispatched on the port. Issue Ports are described in
Intel® 64 and IA-32 Architectures Optimization Reference
Manual. Use IA32_PMC0 only.
A1H 02H RS_UOPS_ Cycles micro-ops This event counts the number of cycles for which micro-ops
DISPATCHED.PORT1 dispatched for dispatched for execution. Each cycle, at most one micro-op
execution on port 1. can be dispatched on the port. Use IA32_PMC0 only.
A1H 04H RS_UOPS_ Cycles micro-ops This event counts the number of cycles for which micro-ops
DISPATCHED.PORT2 dispatched for dispatched for execution. Each cycle, at most one micro-op
execution on port 2. can be dispatched on the port. Use IA32_PMC0 only.
Vol. 3B 19-175
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
A1H 08H RS_UOPS_ Cycles micro-ops This event counts the number of cycles for which micro-ops
DISPATCHED.PORT3 dispatched for dispatched for execution. Each cycle, at most one micro-op
execution on port 3. can be dispatched on the port. Use IA32_PMC0 only.
A1H 10H RS_UOPS_ Cycles micro-ops This event counts the number of cycles for which micro-ops
DISPATCHED.PORT4 dispatched for dispatched for execution. Each cycle, at most one micro-op
execution on port 4. can be dispatched on the port. Use IA32_PMC0 only.
A1H 20H RS_UOPS_ Cycles micro-ops This event counts the number of cycles for which micro-ops
DISPATCHED.PORT5 dispatched for dispatched for execution. Each cycle, at most one micro-op
execution on port 5. can be dispatched on the port. Use IA32_PMC0 only.
AAH 01H MACRO_INSTS. Instructions decoded. This event counts the number of instructions decoded (but
DECODED not necessarily executed or retired).
AAH 08H MACRO_INSTS. CISC Instructions This event counts the number of complex instructions
CISC_DECODED decoded. decoded. Complex instructions usually have more than four
micro-ops. Only one complex instruction can be decoded at a
time.
ABH 01H ESP.SYNCH ESP register content This event counts the number of times that the ESP register
synchron-ization. is explicitly used in the address expression of a load or store
operation, after it is implicitly used, for example by a push or
a pop instruction.
ESP synch micro-op uses resources from the rename pipe-
stage and up to retirement. The expected ratio of this event
divided by the number of ESP implicit changes is 0,2. If the
ratio is higher, consider rearranging your code to avoid ESP
synchronization events.
ABH 02H ESP.ADDITIONS ESP register automatic This event counts the number of ESP additions performed
additions. automatically by the decoder. A high count of this event is
good, since each automatic addition performed by the
decoder saves a micro-op from the execution units.
To maximize the number of ESP additions performed
automatically by the decoder, choose instructions that
implicitly use the ESP, such as PUSH, POP, CALL, and RET
instructions whenever possible.
B0H 00H SIMD_UOPS_EXEC SIMD micro-ops This event counts all the SIMD micro-ops executed. It does
executed (excluding not count MOVQ and MOVD stores from register to memory.
stores).
B1H 00H SIMD_SAT_UOP_ SIMD saturated This event counts the number of SIMD saturated arithmetic
EXEC arithmetic micro-ops micro-ops executed.
executed.
B3H 01H SIMD_UOP_ SIMD packed multiply This event counts the number of SIMD packed multiply
TYPE_EXEC.MUL micro-ops executed. micro-ops executed.
B3H 02H SIMD_UOP_TYPE_EXEC.SHI SIMD packed shift This event counts the number of SIMD packed shift micro-
FT micro-ops executed. ops executed.
B3H 04H SIMD_UOP_TYPE_EXEC.PA SIMD pack micro-ops This event counts the number of SIMD pack micro-ops
CK executed. executed.
B3H 08H SIMD_UOP_TYPE_EXEC.UN SIMD unpack micro- This event counts the number of SIMD unpack micro-ops
PACK ops executed. executed.
B3H 10H SIMD_UOP_TYPE_EXEC.LO SIMD packed logical This event counts the number of SIMD packed logical micro-
GICAL micro-ops executed. ops executed.
19-176 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
B3H 20H SIMD_UOP_TYPE_EXEC.ARI SIMD packed This event counts the number of SIMD packed arithmetic
THMETIC arithmetic micro-ops micro-ops executed.
executed.
C0H 00H INST_RETIRED. Instructions retired. This event counts the number of instructions that retire
ANY_P execution. For instructions that consist of multiple micro-
ops, this event counts the retirement of the last micro-op of
the instruction. The counter continues counting during
hardware interrupts, traps, and inside interrupt handlers.
INST_RETIRED.ANY_P is an architectural performance
event.
C0H 01H INST_RETIRED. Instructions retired, This event counts the number of instructions retired that
LOADS which contain a load. contain a load operation.
C0H 02H INST_RETIRED. Instructions retired, This event counts the number of instructions retired that
STORES which contain a store. contain a store operation.
C0H 04H INST_RETIRED. Instructions retired, This event counts the number of instructions retired that do
OTHER with no load or store not contain a load or a store operation.
operation.
C1H 01H X87_OPS_ FXCH instructions This event counts the number of FXCH instructions retired.
RETIRED.FXCH retired. Modern compilers generate more efficient code and are less
likely to use this instruction. If you obtain a high count for
this event consider recompiling the code.
C1H FEH X87_OPS_ Retired floating-point This event counts the number of floating-point
RETIRED.ANY computational computational operations retired. It counts:
operations (precise • Floating point computational operations executed by the
event). assist handler.
• Sub-operations of complex floating-point instructions like
transcendental instructions.
This event does not count:
• Floating-point computational operations that cause traps
or assists.
• Floating-point loads and stores.
When this event is captured with the precise event
mechanism, the collected samples contain the address of
the instruction that was executed immediately after the
instruction that caused the event.
C2H 01H UOPS_RETIRED. Fused load+op or This event counts the number of retired micro-ops that
LD_IND_BR load+indirect branch fused a load with another operation. This includes:
retired. • Fusion of a load and an arithmetic operation, such as with
the following instruction: ADD EAX, [EBX] where the
content of the memory location specified by EBX register
is loaded, added to EXA register, and the result is stored
in EAX.
• Fusion of a load and a branch in an indirect branch
operation, such as with the following instructions:
• JMP [RDI+200]
• RET
• Fusion decreases the number of micro-ops in the
processor pipeline. A high value for this event count
indicates that the code is using the processor resources
effectively.
Vol. 3B 19-177
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
C2H 02H UOPS_RETIRED. Fused store address + This event counts the number of store address calculations
STD_STA data retired. that are fused with store data emission into one micro-op.
Traditionally, each store operation required two micro-ops.
This event counts fusion of retired micro-ops only. Fusion
decreases the number of micro-ops in the processor
pipeline. A high value for this event count indicates that the
code is using the processor resources effectively.
C2H 04H UOPS_RETIRED. Retired instruction This event counts the number of times CMP or TEST
MACRO_FUSION pairs fused into one instructions were fused with a conditional branch
micro-op. instruction into one micro-op. It counts fusion by retired
micro-ops only.
Fusion decreases the number of micro-ops in the processor
pipeline. A high value for this event count indicates that the
code uses the processor resources more effectively.
C2H 07H UOPS_RETIRED. Fused micro-ops This event counts the total number of retired fused micro-
FUSED retired. ops. The counts include the following fusion types:
• Fusion of load operation with an arithmetic operation or
with an indirect branch (counted by event
UOPS_RETIRED.LD_IND_BR)
• Fusion of store address and data (counted by event
UOPS_RETIRED.STD_STA)
• Fusion of CMP or TEST instruction with a conditional
branch instruction (counted by event
UOPS_RETIRED.MACRO_FUSION)
Fusion decreases the number of micro-ops in the processor
pipeline. A high value for this event count indicates that the
code is using the processor resources effectively.
C2H 08H UOPS_RETIRED. Non-fused micro-ops This event counts the number of micro-ops retired that
NON_FUSED retired. were not fused.
C2H 0FH UOPS_RETIRED. Micro-ops retired. This event counts the number of micro-ops retired. The
ANY processor decodes complex macro instructions into a
sequence of simpler micro-ops. Most instructions are
composed of one or two micro-ops.
Some instructions are decoded into longer sequences such
as repeat instructions, floating point transcendental
instructions, and assists. In some cases micro-op sequences
are fused or whole instructions are fused into one micro-op.
See other UOPS_RETIRED events for differentiating retired
fused and non-fused micro-ops.
C3H 01H MACHINE_ Self-Modifying Code This event counts the number of times that a program
NUKES.SMC detected. writes to a code section. Self-modifying code causes a
severe penalty in all Intel 64 and IA-32 processors.
19-178 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
C3H 04H MACHINE_NUKES.MEM_OR Execution pipeline This event counts the number of times the pipeline is
DER restart due to memory restarted due to either multi-threaded memory ordering
ordering conflict or conflicts or memory disambiguation misprediction.
memory A multi-threaded memory ordering conflict occurs when a
disambiguation store, which is executed in another core, hits a load that is
misprediction. executed out of order in this core but not yet retired. As a
result, the load needs to be restarted to satisfy the memory
ordering model.
See Chapter 8, “Multiple-Processor Management” in the
Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A.
To count memory disambiguation mispredictions, use the
event MEMORY_DISAMBIGUATION.RESET.
C4H 00H BR_INST_RETIRED.ANY Retired branch This event counts the number of branch instructions retired.
instructions. This is an architectural performance event.
C4H 01H BR_INST_RETIRED.PRED_N Retired branch This event counts the number of branch instructions retired
OT_ instructions that were that were correctly predicted to be not-taken.
TAKEN predicted not-taken.
C4H 02H BR_INST_RETIRED.MISPRE Retired branch This event counts the number of branch instructions retired
D_NOT_ instructions that were that were mispredicted and not-taken.
TAKEN mispredicted not-
taken.
C4H 04H BR_INST_RETIRED.PRED_T Retired branch This event counts the number of branch instructions retired
AKEN instructions that were that were correctly predicted to be taken.
predicted taken.
C4H 08H BR_INST_RETIRED.MISPRE Retired branch This event counts the number of branch instructions retired
D_TAKEN instructions that were that were mispredicted and taken.
mispredicted taken.
C4H 0CH BR_INST_RETIRED.TAKEN Retired taken branch This event counts the number of branches retired that were
instructions. taken.
C5H 00H BR_INST_RETIRED.MISPRE Retired mispredicted This event counts the number of retired branch instructions
D branch instructions. that were mispredicted by the processor. A branch
(precise event) misprediction occurs when the processor predicts that the
branch would be taken, but it is not, or vice-versa.
This is an architectural performance event.
C6H 01H CYCLES_INT_ Cycles during which This event counts the number of cycles during which
MASKED interrupts are interrupts are disabled.
disabled.
C6H 02H CYCLES_INT_ Cycles during which This event counts the number of cycles during which there
PENDING_AND interrupts are pending are pending interrupts but interrupts are disabled.
_MASKED and disabled.
C7H 01H SIMD_INST_ Retired SSE packed- This event counts the number of SSE packed-single
RETIRED.PACKED_SINGLE single instructions. instructions retired.
C7H 02H SIMD_INST_ Retired SSE scalar- This event counts the number of SSE scalar-single
RETIRED.SCALAR_SINGLE single instructions. instructions retired.
C7H 04H SIMD_INST_ Retired SSE2 packed- This event counts the number of SSE2 packed-double
RETIRED.PACKED_DOUBLE double instructions. instructions retired.
C7H 08H SIMD_INST_ Retired SSE2 scalar- This event counts the number of SSE2 scalar-double
RETIRED.SCALAR_DOUBLE double instructions. instructions retired.
Vol. 3B 19-179
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
C7H 10H SIMD_INST_ Retired SSE2 vector This event counts the number of SSE2 vector integer
RETIRED.VECTOR integer instructions. instructions retired.
C7H 1FH SIMD_INST_ Retired Streaming This event counts the overall number of retired SIMD
RETIRED.ANY SIMD instructions instructions that use XMM registers. To count each type of
(precise event). SIMD instruction separately, use the following events:
• SIMD_INST_RETIRED.PACKED_SINGLE
• SIMD_INST_RETIRED.SCALAR_SINGLE
• SIMD_INST_RETIRED.PACKED_DOUBLE
• SIMD_INST_RETIRED.SCALAR_DOUBLE
• and SIMD_INST_RETIRED.VECTOR
When this event is captured with the precise event
mechanism, the collected samples contain the address of
the instruction that was executed immediately after the
instruction that caused the event.
C8H 00H HW_INT_RCV Hardware interrupts This event counts the number of hardware interrupts
received. received by the processor.
C9H 00H ITLB_MISS_ Retired instructions This event counts the number of retired instructions that
RETIRED that missed the ITLB. missed the ITLB when they were fetched.
CAH 01H SIMD_COMP_ Retired computational This event counts the number of computational SSE packed-
INST_RETIRED. SSE packed-single single instructions retired. Computational instructions
PACKED_SINGLE instructions. perform arithmetic computations (for example: add, multiply
and divide).
Instructions that perform load and store operations or
logical operations, like XOR, OR, and AND are not counted by
this event.
CAH 02H SIMD_COMP_ Retired computational This event counts the number of computational SSE scalar-
INST_RETIRED. SSE scalar-single single instructions retired. Computational instructions
SCALAR_SINGLE instructions. perform arithmetic computations (for example: add, multiply
and divide).
Instructions that perform load and store operations or
logical operations, like XOR, OR, and AND are not counted by
this event.
CAH 04H SIMD_COMP_ Retired computational This event counts the number of computational SSE2
INST_RETIRED. SSE2 packed-double packed-double instructions retired. Computational
PACKED_DOUBLE instructions. instructions perform arithmetic computations (for example:
add, multiply and divide).
Instructions that perform load and store operations or
logical operations, like XOR, OR, and AND are not counted by
this event.
CAH 08H SIMD_COMP_INST_RETIRE Retired computational This event counts the number of computational SSE2 scalar-
D.SCALAR_DOUBLE SSE2 scalar-double double instructions retired. Computational instructions
instructions. perform arithmetic computations (for example: add, multiply
and divide).
Instructions that perform load and store operations or
logical operations, like XOR, OR, and AND are not counted by
this event.
19-180 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
CBH 01H MEM_LOAD_ Retired loads that miss This event counts the number of retired load operations
RETIRED.L1D the L1 data cache that missed the L1 data cache. This includes loads from
_MISS (precise event). cache lines that are currently being fetched, due to a
previous L1 data cache miss to the same cache line.
This event counts loads from cacheable memory only. The
event does not count loads by software prefetches.
When this event is captured with the precise event
mechanism, the collected samples contain the address of
the instruction that was executed immediately after the
instruction that caused the event.
Use IA32_PMC0 only.
CBH 02H MEM_LOAD_ L1 data cache line This event counts the number of load operations that miss
RETIRED.L1D_ missed by retired the L1 data cache and send a request to the L2 cache to
LINE_MISS loads (precise event). fetch the missing cache line. That is the missing cache line
fetching has not yet started.
The event count is equal to the number of cache lines
fetched from the L2 cache by retired loads.
This event counts loads from cacheable memory only. The
event does not count loads by software prefetches.
The event might not be counted if the load is blocked (see
LOAD_BLOCK events).
When this event is captured with the precise event
mechanism, the collected samples contain the address of
the instruction that was executed immediately after the
instruction that caused the event.
Use IA32_PMC0 only.
CBH 04H MEM_LOAD_ Retired loads that miss This event counts the number of retired load operations
RETIRED.L2_MISS the L2 cache (precise that missed the L2 cache.
event). This event counts loads from cacheable memory only. It
does not count loads by software prefetches.
When this event is captured with the precise event
mechanism, the collected samples contain the address of
the instruction that was executed immediately after the
instruction that caused the event.
Use IA32_PMC0 only.
Vol. 3B 19-181
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
CBH 08H MEM_LOAD_ L2 cache line missed This event counts the number of load operations that miss
RETIRED.L2_LINE_MISS by retired loads the L2 cache and result in a bus request to fetch the missing
(precise event). cache line. That is the missing cache line fetching has not
yet started.
This event count is equal to the number of cache lines
fetched from memory by retired loads.
This event counts loads from cacheable memory only. The
event does not count loads by software prefetches.
The event might not be counted if the load is blocked (see
LOAD_BLOCK events).
When this event is captured with the precise event
mechanism, the collected samples contain the address of
the instruction that was executed immediately after the
instruction that caused the event.
Use IA32_PMC0 only.
CBH 10H MEM_LOAD_ Retired loads that miss This event counts the number of retired loads that missed
RETIRED.DTLB_ the DTLB (precise the DTLB. The DTLB miss is not counted if the load
MISS event). operation causes a fault.
This event counts loads from cacheable memory only. The
event does not count loads by software prefetches.
When this event is captured with the precise event
mechanism, the collected samples contain the address of
the instruction that was executed immediately after the
instruction that caused the event.
Use IA32_PMC0 only.
CCH 01H FP_MMX_TRANS_TO_MMX Transitions from This event counts the first MMX instructions following a
Floating Point to MMX floating-point instruction. Use this event to estimate the
Instructions. penalties for the transitions between floating-point and
MMX states.
CCH 02H FP_MMX_TRANS_TO_FP Transitions from MMX This event counts the first floating-point instructions
Instructions to following any MMX instruction. Use this event to estimate
Floating Point the penalties for the transitions between floating-point and
Instructions. MMX states.
CDH 00H SIMD_ASSIST SIMD assists invoked. This event counts the number of SIMD assists invoked. SIMD
assists are invoked when an EMMS instruction is executed,
changing the MMX state in the floating point stack.
CEH 00H SIMD_INSTR_ SIMD Instructions This event counts the number of retired SIMD instructions
RETIRED retired. that use MMX registers.
CFH 00H SIMD_SAT_INSTR_RETIRED Saturated arithmetic This event counts the number of saturated arithmetic SIMD
instructions retired. instructions that retired.
D2H 01H RAT_STALLS. ROB read port stalls This event counts the number of cycles when ROB read port
ROB_READ_PORT cycles. stalls occurred, which did not allow new micro-ops to enter
the out-of-order pipeline.
Note that, at this stage in the pipeline, additional stalls may
occur at the same cycle and prevent the stalled micro-ops
from entering the pipe. In such a case, micro-ops retry
entering the execution pipe in the next cycle and the ROB-
read-port stall is counted again.
19-182 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
D2H 02H RAT_STALLS. Partial register stall This event counts the number of cycles instruction
PARTIAL_CYCLES cycles. execution latency became longer than the defined latency
because the instruction uses a register that was partially
written by previous instructions.
D2H 04H RAT_STALLS. Flag stall cycles. This event counts the number of cycles during which
FLAGS execution stalled due to several reasons, one of which is a
partial flag register stall.
A partial register stall may occur when two conditions are
met:
• An instruction modifies some, but not all, of the flags in
the flag register.
• The next instruction, which depends on flags, depends on
flags that were not modified by this instruction.
D2H 08H RAT_STALLS. FPU status word stall. This event indicates that the FPU status word (FPSW) is
FPSW written. To obtain the number of times the FPSW is written
divide the event count by 2.
The FPSW is written by instructions with long latency; a
small count may indicate a high penalty.
D2H 0FH RAT_STALLS. All RAT stall cycles. This event counts the number of stall cycles due to
ANY conditions described by:
• RAT_STALLS.ROB_READ_PORT
• RAT_STALLS.PARTIAL
• RAT_STALLS.FLAGS
• RAT_STALLS.FPSW.
D4H 01H SEG_RENAME_ Segment rename stalls This event counts the number of stalls due to the lack of
STALLS.ES - ES. renaming resources for the ES segment register. If a
segment is renamed, but not retired and a second update to
the same segment occurs, a stall occurs in the front end of
the pipeline until the renamed segment retires.
D4H 02H SEG_RENAME_ Segment rename stalls This event counts the number of stalls due to the lack of
STALLS.DS - DS. renaming resources for the DS segment register. If a
segment is renamed, but not retired and a second update to
the same segment occurs, a stall occurs in the front end of
the pipeline until the renamed segment retires.
D4H 04H SEG_RENAME_ Segment rename stalls This event counts the number of stalls due to the lack of
STALLS.FS - FS. renaming resources for the FS segment register.
If a segment is renamed, but not retired and a second
update to the same segment occurs, a stall occurs in the
front end of the pipeline until the renamed segment retires.
D4H 08H SEG_RENAME_ Segment rename stalls This event counts the number of stalls due to the lack of
STALLS.GS - GS. renaming resources for the GS segment register.
If a segment is renamed, but not retired and a second
update to the same segment occurs, a stall occurs in the
front end of the pipeline until the renamed segment retires.
D4H 0FH SEG_RENAME_ Any (ES/DS/FS/GS) This event counts the number of stalls due to the lack of
STALLS.ANY segment rename stall. renaming resources for the ES, DS, FS, and GS segment
registers.
If a segment is renamed but not retired and a second update
to the same segment occurs, a stall occurs in the front end
of the pipeline until the renamed segment retires.
Vol. 3B 19-183
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
D5H 01H SEG_REG_ Segment renames - This event counts the number of times the ES segment
RENAMES.ES ES. register is renamed.
D5H 02H SEG_REG_ Segment renames - This event counts the number of times the DS segment
RENAMES.DS DS. register is renamed.
D5H 04H SEG_REG_ Segment renames - This event counts the number of times the FS segment
RENAMES.FS FS. register is renamed.
D5H 08H SEG_REG_ Segment renames - This event counts the number of times the GS segment
RENAMES.GS GS. register is renamed.
D5H 0FH SEG_REG_ Any (ES/DS/FS/GS) This event counts the number of times any of the four
RENAMES.ANY segment rename. segment registers (ES/DS/FS/GS) is renamed.
DCH 01H RESOURCE_ Cycles during which This event counts the number of cycles when the number of
STALLS.ROB_FULL the ROB full. instructions in the pipeline waiting for retirement reaches
the limit the processor can handle.
A high count for this event indicates that there are long
latency operations in the pipe (possibly load and store
operations that miss the L2 cache, and other instructions
that depend on these cannot execute until the former
instructions complete execution). In this situation new
instructions cannot enter the pipe and start execution.
DCH 02H RESOURCE_ Cycles during which This event counts the number of cycles when the number of
STALLS.RS_FULL the RS full. instructions in the pipeline waiting for execution reaches
the limit the processor can handle.
A high count of this event indicates that there are long
latency operations in the pipe (possibly load and store
operations that miss the L2 cache, and other instructions
that depend on these cannot execute until the former
instructions complete execution). In this situation new
instructions cannot enter the pipe and start execution.
DCH 04 RESOURCE_ Cycles during which This event counts the number of cycles while resource-
STALLS.LD_ST the pipeline has related stalls occur due to:
exceeded load or store • The number of load instructions in the pipeline reached
limit or waiting to the limit the processor can handle. The stall ends when a
commit all stores. loading instruction retires.
• The number of store instructions in the pipeline reached
the limit the processor can handle. The stall ends when a
storing instruction commits its data to the cache or
memory.
• There is an instruction in the pipe that can be executed
only when all previous stores complete and their data is
committed in the caches or memory. For example, the
SFENCE and MFENCE instructions require this behavior.
DCH 08H RESOURCE_ Cycles stalled due to This event counts the number of cycles while execution was
STALLS.FPCW FPU control word stalled due to writing the floating-point unit (FPU) control
write. word.
DCH 10H RESOURCE_ Cycles stalled due to This event counts the number of cycles after a branch
STALLS.BR_MISS_CLEAR branch misprediction. misprediction is detected at execution until the branch and
all older micro-ops retire. During this time new micro-ops
cannot enter the out-of-order pipeline.
19-184 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-27. Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)
Event Umask Description and
Num Value Event Name Definition Comment
DCH 1FH RESOURCE_ Resource related This event counts the number of cycles while resource-
STALLS.ANY stalls. related stalls occurs for any conditions described by the
following events:
• RESOURCE_STALLS.ROB_FULL
• RESOURCE_STALLS.RS_FULL
• RESOURCE_STALLS.LD_ST
• RESOURCE_STALLS.FPCW
• RESOURCE_STALLS.BR_MISS_CLEAR
E0H 00H BR_INST_ Branch instructions This event counts the number of branch instructions
DECODED decoded. decoded.
E4H 00H BOGUS_BR Bogus branches. This event counts the number of byte sequences that were
mistakenly detected as taken branch instructions.
This results in a BACLEAR event. This occurs mainly after
task switches.
E6H 00H BACLEARS BACLEARS asserted. This event counts the number of times the front end is
resteered, mainly when the BPU cannot provide a correct
prediction and this is corrected by other branch handling
mechanisms at the front and. This can occur if the code has
many branches such that they cannot be consumed by the
BPU.
Each BACLEAR asserted costs approximately 7 cycles of
instruction fetch. The effect on total execution time
depends on the surrounding code.
F0H 00H PREF_RQSTS_UP Upward prefetches This event counts the number of upward prefetches issued
issued from DPL. from the Data Prefetch Logic (DPL) to the L2 cache. A
prefetch request issued to the L2 cache cannot be cancelled
and the requested cache line is fetched to the L2 cache.
F8H 00H PREF_RQSTS_DN Downward prefetches This event counts the number of downward prefetches
issued from DPL. issued from the Data Prefetch Logic (DPL) to the L2 cache. A
prefetch request issued to the L2 cache cannot be cancelled
and the requested cache line is fetched to the L2 cache.
Vol. 3B 19-185
PERFORMANCE MONITORING EVENTS
19-186 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-28. Performance Events for the Goldmont Plus Microarchitecture (Contd.)
Event Umask
Num. Value Event Name Description Comment
4FH 10H EPT.WALK_PENDING Counts once per cycle for each page walk only while traversing the
Extended Page Table (EPT), and does not count during the rest of the
translation. The EPT is used for translating Guest-Physical Addresses to
Physical Addresses for Virtual Machine Monitors (VMMs). Average
cycles per walk can be calculated by dividing the count by number of
walks.
85H 02H ITLB_MISSES.WALK_CO Counts page walks completed due to instruction fetches whose address
MPLETED_4K translations missed in the TLB and were mapped to 4K pages. The page
walks can end with or without a page fault.
85H 04H ITLB_MISSES.WALK_CO Counts page walks completed due to instruction fetches whose address
MPLETED_2M_4M translations missed in the TLB and were mapped to 2M or 4M pages.
The page walks can end with or without a page fault.
85H 08H ITLB_MISSES.WALK_CO Counts page walks completed due to instruction fetches whose address
MPLETED_1GB translations missed in the TLB and were mapped to 1GB pages. The
page walks can end with or without a page fault.
85H 10H ITLB_MISSES.WALK_PE Counts once per cycle for each page walk occurring due to an
NDING instruction fetch. Includes cycles spent traversing the Extended Page
Table (EPT). Average cycles per walk can be calculated by dividing by
the number of walks.
BDH 20H TLB_FLUSHES.STLB_AN Counts STLB flushes. The TLBs are flushed on instructions like INVLPG
Y and MOV to CR3.
C3H 20H MACHINE_CLEARS.PAGE Counts the number of times that the machines clears due to a page
_FAULT fault. Covers both I-side and D-side (Loads/Stores) page faults. A page
fault occurs when either page is not present, or an access violation.
Vol. 3B 19-187
PERFORMANCE MONITORING EVENTS
05H 03H PAGE_WALKS.CYCLES Counts every core cycle a page-walk is in progress due to either a data
memory operation, or an instruction fetch.
0EH 00H UOPS_ISSUED.ANY Counts uops issued by the front end and allocated into the back end of
the machine. This event counts uops that retire as well as uops that
were speculatively executed but didn't retire. The sort of speculative
uops that might be counted includes, but is not limited to those uops
issued in the shadow of a mispredicted branch, those uops that are
inserted during an assist (such as for a denormal floating-point result),
and (previously allocated) uops that might be canceled during a
machine clear.
13H 02H MISALIGN_MEM_REF.LO Counts when a memory load of a uop that spans a page boundary (a Precise Event
AD_PAGE_SPLIT split) is retired.
13H 04H MISALIGN_MEM_REF.ST Counts when a memory store of a uop that spans a page boundary (a Precise Event
ORE_PAGE_SPLIT split) is retired.
2EH 4FH LONGEST_LAT_CACHE. Counts memory requests originating from the core that reference a
REFERENCE cache line in the L2 cache.
2EH 41H LONGEST_LAT_CACHE. Counts memory requests originating from the core that miss in the L2
MISS cache.
30H 00H L2_REJECT_XQ.ALL Counts the number of demand and prefetch transactions that the L2
XQ rejects due to a full or near full condition which likely indicates back
pressure from the intra-die interconnect (IDI) fabric. The XQ may reject
transactions from the L2Q (non-cacheable requests), L2 misses and L2
write-back victims.
31H 00H CORE_REJECT_L2Q.ALL Counts the number of demand and L1 prefetcher requests rejected by
the L2Q due to a full or nearly full condition which likely indicates back
pressure from L2Q. It also counts requests that would have gone
directly to the XQ, but are rejected due to a full or nearly full condition,
indicating back pressure from the IDI link. The L2Q may also reject
transactions from a core to ensure fairness between cores, or to delay
a core's dirty eviction when the address conflicts with incoming
external snoops.
3CH 00H CPU_CLK_UNHALTED.C Core cycles when core is not halted. This event uses a programmable
ORE_P general purpose performance counter.
3CH 01H CPU_CLK_UNHALTED.R Reference cycles when core is not halted. This event uses a
EF programmable general purpose performance counter.
51H 01H DL1.DIRTY_EVICTION Counts when a modified (dirty) cache line is evicted from the data L1
cache and needs to be written back to memory. No count will occur if
the evicted line is clean, and hence does not require a writeback.
19-188 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-189
PERFORMANCE MONITORING EVENTS
C0H 00H INST_RETIRED.ANY_P Counts the number of instructions that retire execution. For Precise Event
instructions that consist of multiple uops, this event counts the
retirement of the last uop of the instruction. The event continues
counting during hardware interrupts, traps, and inside interrupt
handlers. This is an architectural performance event. This event uses a
programmable general purpose performance counter. *This event is a
Precise Event: the EventingRIP field in the PEBS record is precise to the
address of the instruction which caused the event.
Note: Because PEBS records can be collected only on IA32_PMC0, only
one event can use the PEBS facility at a time.
C2H 00H UOPS_RETIRED.ANY Counts uops which have retired. Precise Event,
Not Reduced
Skid
19-190 Vol. 3B
PERFORMANCE MONITORING EVENTS
C4H 7EH BR_INST_RETIRED.JCC Counts retired Jcc (Jump on Conditional Code/Jump if Condition is Met) Precise Event
branch instructions retired, including both when the branch was taken
and when it was not taken.
C4H 80H BR_INST_RETIRED.ALL_ Counts the number of taken branch instructions retired. Precise Event
TAKEN_BRANCHES
C4H FEH BR_INST_RETIRED.TAK Counts Jcc (Jump on Conditional Code/Jump if Condition is Met) branch Precise Event
EN_JCC instructions retired that were taken and does not count when the Jcc
branch instruction were not taken.
C4H F9H BR_INST_RETIRED.CALL Counts near CALL branch instructions retired. Precise Event
C4H FDH BR_INST_RETIRED.REL_ Counts near relative CALL branch instructions retired. Precise Event
CALL
C4H FBH BR_INST_RETIRED.IND_ Counts near indirect CALL branch instructions retired. Precise Event
CALL
C4H F7H BR_INST_RETIRED.RET Counts near return branch instructions retired. Precise Event
URN
C4H EBH BR_INST_RETIRED.NON Counts near indirect call or near indirect jmp branch instructions retired. Precise Event
_RETURN_IND
C4H BFH BR_INST_RETIRED.FAR Counts far branch instructions retired. This includes far jump, far call Precise Event
_BRANCH and return, and Interrupt call and return.
C5H 00H BR_MISP_RETIRED.ALL Counts mispredicted branch instructions retired including all branch Precise Event
_BRANCHES types.
Vol. 3B 19-191
PERFORMANCE MONITORING EVENTS
C5H FBH BR_MISP_RETIRED.IND_ Counts mispredicted near indirect CALL branch instructions retired, Precise Event
CALL where the target address taken was not what the processor predicted.
C5H F7H BR_MISP_RETIRED.RET Counts mispredicted near RET branch instructions retired, where the Precise Event
URN return address taken was not what the processor predicted.
C5H EBH BR_MISP_RETIRED.NON Counts mispredicted branch instructions retired that were near indirect Precise Event
_RETURN_IND call or near indirect jmp, where the target address taken was not what
the processor predicted.
CAH 01H ISSUE_SLOTS_NOT_CO Counts the number of issue slots per core cycle that were not
NSUMED.RESOURCE_FU consumed because of a full resource in the back end. Including but not
LL limited to resources include the Re-order Buffer (ROB), reservation
stations (RS), load/store buffers, physical registers, or any other
needed machine resource that is currently unavailable. Note that uops
must be available for consumption in order for this event to fire. If a
uop is not available (Instruction Queue is empty), this event will not
count.
CAH 02H ISSUE_SLOTS_NOT_CO Counts the number of issue slots per core cycle that were not
NSUMED.RECOVERY consumed by the back end because allocation is stalled waiting for a
mispredicted jump to retire or other branch-like conditions (e.g. the
event is relevant during certain microcode flows). Counts all issue slots
blocked while within this window, including slots where uops were not
available in the Instruction Queue.
CAH 00H ISSUE_SLOTS_NOT_CO Counts the number of issue slots per core cycle that were not
NSUMED.ANY consumed by the back end due to either a full resource in the back end
(RESOURCE_FULL), or due to the processor recovering from some
event (RECOVERY).
CBH 01H HW_INTERRUPTS.RECEI Counts hardware interrupts received by the processor.
VED
CBH 02H HW_INTERRUPTS.MASK Counts the number of core cycles during which interrupts are masked
ED (disabled). Increments by 1 each core cycle that EFLAGS.IF is 0,
regardless of whether interrupts are pending or not.
CBH 04H HW_INTERRUPTS.PENDI Counts core cycles during which there are pending interrupts, but
NG_AND_MASKED interrupts are masked (EFLAGS.IF = 0).
CDH 00H CYCLES_DIV_BUSY.ALL Counts core cycles if either divide unit is busy.
CDH 01H CYCLES_DIV_BUSY.IDIV Counts core cycles if the integer divide unit is busy.
CDH 02H CYCLES_DIV_BUSY.FPDI Counts core cycles if the floating point divide unit is busy.
V
D0H 81H MEM_UOPS_RETIRED.A Counts the number of load uops retired. Precise Event
LL_LOADS
D0H 82H MEM_UOPS_RETIRED.A Counts the number of store uops retired. Precise Event
LL_STORES
19-192 Vol. 3B
PERFORMANCE MONITORING EVENTS
D0H 21H MEM_UOPS_RETIRED.L Counts locked memory uops retired. This includes 'regular' locks and Precise Event
OCK_LOADS bus locks. To specifically count bus locks only, see the offcore response
event. A locked access is one with a lock prefix, or an exchange to
memory.
D0H 41H MEM_UOPS_RETIRED.S Counts load uops retired where the data requested spans a 64 byte Precise Event
PLIT_LOADS cache line boundary.
D0H 42H MEM_UOPS_RETIRED.S Counts store uops retired where the data requested spans a 64 byte Precise Event
PLIT_STORES cache line boundary.
D0H 43H MEM_UOPS_RETIRED.S Counts memory uops retired where the data requested spans a 64 Precise Event
PLIT byte cache line boundary.
D1H 01H MEM_LOAD_UOPS_RETI Counts load uops retired that hit the L1 data cache. Precise Event
RED.L1_HIT
D1H 08H MEM_LOAD_UOPS_RETI Counts load uops retired that miss the L1 data cache. Precise Event
RED.L1_MISS
D1H 02H MEM_LOAD_UOPS_RETI Counts load uops retired that hit in the L2 cache. Precise Event
RED.L2_HIT
D1H 10H MEM_LOAD_UOPS_RETI Counts load uops retired that miss in the L2 cache. Precise Event
RED.L2_MISS
D1H 20H MEM_LOAD_UOPS_RETI Counts load uops retired where the cache line containing the data was Precise Event
RED.HITM in the modified state of another core or modules cache (HITM). More
specifically, this means that when the load address was checked by
other caching agents (typically another processor) in the system, one
of those caching agents indicated that they had a dirty copy of the
data. Loads that obtain a HITM response incur greater latency than
most that is typical for a load. In addition, since HITM indicates that
some other processor had this data in its cache, it implies that the data
was shared between processors, or potentially was a lock or
semaphore value. This event is useful for locating sharing, false
sharing, and contended locks.
Vol. 3B 19-193
PERFORMANCE MONITORING EVENTS
19-194 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-195
PERFORMANCE MONITORING EVENTS
19-196 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-197
PERFORMANCE MONITORING EVENTS
19-198 Vol. 3B
PERFORMANCE MONITORING EVENTS
07H 06H PREFETCH.SW_L2 Streaming SIMD This event counts the number of times the SSE instructions
Extensions (SSE) prefetchT1 and prefetchT2 are executed. These instructions
PrefetchT1 and prefetch the data to the L2 cache.
PrefetchT2
instructions executed.
Vol. 3B 19-199
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
07H 08H PREFETCH.PREFETCHN Streaming SIMD This event counts the number of times the SSE instruction
TA Extensions (SSE) prefetchNTA is executed. This instruction prefetches the data
Prefetch NTA to the L1 data cache.
instructions executed.
08H 07H DATA_TLB_MISSES.DT Memory accesses that This event counts the number of Data Table Lookaside Buffer
LB_MISS missed the DTLB. (DTLB) misses. The count includes misses detected as a result
of speculative accesses. Typically a high count for this event
indicates that the code accesses a large number of data pages.
08H 05H DATA_TLB_MISSES.DT DTLB misses due to This event counts the number of Data Table Lookaside Buffer
LB_MISS_LD load operations. (DTLB) misses due to load operations. This count includes
misses detected as a result of speculative accesses.
08H 09H DATA_TLB_MISSES.L0 L0_DTLB misses due to This event counts the number of L0_DTLB misses due to load
_DTLB_MISS_LD load operations. operations. This count includes misses detected as a result of
speculative accesses.
08H 06H DATA_TLB_MISSES.DT DTLB misses due to This event counts the number of Data Table Lookaside Buffer
LB_MISS_ST store operations. (DTLB) misses due to store operations. This count includes
misses detected as a result of speculative accesses.
0CH 03H PAGE_WALKS.WALKS Number of page-walks This event counts the number of page-walks executed due to
executed. either a DTLB or ITLB miss. The page walk duration,
PAGE_WALKS.CYCLES, divided by number of page walks is the
average duration of a page walk. This can hint to whether most
of the page-walks are satisfied by the caches or cause an L2
cache miss.
Edge trigger bit must be set.
0CH 03H PAGE_WALKS.CYCLES Duration of page-walks This event counts the duration of page-walks in core cycles. The
in core cycles. paging mode in use typically affects the duration of page walks.
Page walk duration divided by number of page walks is the
average duration of page-walks. This can hint at whether most
of the page-walks are satisfied by the caches or cause an L2
cache miss.
Edge trigger bit must be cleared.
10H 01H X87_COMP_OPS_EXE. Floating point This event counts the number of x87 floating point
ANY.S computational micro- computational micro-ops executed.
ops executed.
10H 81H X87_COMP_OPS_EXE. Floating point This event counts the number of x87 floating point
ANY.AR computational micro- computational micro-ops retired.
ops retired.
11H 01H FP_ASSIST Floating point assists. This event counts the number of floating point operations
executed that required micro-code assist intervention. These
assists are required in the following cases.
X87 instructions:
1. NaN or denormal are loaded to a register or used as input
from memory.
2. Division by 0.
3. Underflow output.
19-200 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
11H 81H FP_ASSIST.AR Floating point assists. This event counts the number of floating point operations
executed that required micro-code assist intervention. These
assists are required in the following cases.
X87 instructions:
1. NaN or denormal are loaded to a register or used as input
from memory.
2. Division by 0.
3. Underflow output.
12H 01H MUL.S Multiply operations This event counts the number of multiply operations executed.
executed. This includes integer as well as floating point multiply
operations.
12H 81H MUL.AR Multiply operations This event counts the number of multiply operations retired.
retired. This includes integer as well as floating point multiply
operations.
13H 01H DIV.S Divide operations This event counts the number of divide operations executed.
executed. This includes integer divides, floating point divides and square-
root operations executed.
13H 81H DIV.AR Divide operations This event counts the number of divide operations retired. This
retired. includes integer divides, floating point divides and square-root
operations executed.
14H 01H CYCLES_DIV_BUSY Cycles the driver is This event counts the number of cycles the divider is busy
busy. executing divide or square root operations. The divide can be
integer, X87 or Streaming SIMD Extensions (SSE). The square
root operation can be either X87 or SSE.
21H See L2_ADS Cycles L2 address bus This event counts the number of cycles the L2 address bus is
Table is in use. being used for accesses to the L2 cache or bus queue.
18-71 This event can count occurrences for this core or both cores.
22H See L2_DBUS_BUSY Cycles the L2 cache This event counts core cycles during which the L2 cache data
Table data bus is busy. bus is busy transferring data from the L2 cache to the core. It
18-71 counts for all L1 cache misses (data and instruction) that hit the
L2 cache. The count will increment by two for a full cache-line
request.
24H See L2_LINES_IN L2 cache misses. This event counts the number of cache lines allocated in the L2
Table cache. Cache lines are allocated in the L2 cache as a result of
18-71 requests from the L1 data and instruction caches and the L2
and hardware prefetchers to cache lines that are missing in the L2
Table cache.
18-73 This event can count occurrences for this core or both cores.
This event can also count demand requests and L2 hardware
prefetch requests together or separately.
25H See L2_M_LINES_IN L2 cache line This event counts whenever a modified cache line is written
Table modifications. back from the L1 data cache to the L2 cache.
18-71 This event can count occurrences for this core or both cores.
Vol. 3B 19-201
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
26H See L2_LINES_OUT L2 cache lines evicted. This event counts the number of L2 cache lines evicted.
Table This event can count occurrences for this core or both cores.
18-71 This event can also count evictions due to demand requests and
and L2 hardware prefetch requests together or separately.
Table
18-73
27H See L2_M_LINES_OUT Modified lines evicted This event counts the number of L2 modified cache lines
Table from the L2 cache. evicted. These lines are written back to memory unless they
18-71 also exist in a shared-state in one of the L1 data caches.
and This event can count occurrences for this core or both cores.
Table This event can also count evictions due to demand requests and
18-73 L2 hardware prefetch requests together or separately.
28H See L2_IFETCH L2 cacheable This event counts the number of instruction cache line requests
Table instruction fetch from the ICache. It does not include fetch requests from
18-71 requests. uncacheable memory. It does not include ITLB miss accesses.
and This event can count occurrences for this core or both cores.
Table This event can also count accesses to cache lines at different
18-74 MESI states.
29H See L2_LD L2 cache reads. This event counts L2 cache read requests coming from the L1
Table data cache and L2 prefetchers.
18-71, This event can count occurrences for this core or both cores.
Table This event can count occurrences
18-73
- for this core or both cores.
and
Table - due to demand requests and L2 hardware prefetch requests
18-74 together or separately.
- of accesses to cache lines at different MESI states.
2AH See L2_ST L2 store requests. This event counts all store operations that miss the L1 data
Table cache and request the data from the L2 cache.
18-71 This event can count occurrences for this core or both cores.
and This event can also count accesses to cache lines at different
Table MESI states.
18-74
2BH See L2_LOCK L2 locked accesses. This event counts all locked accesses to cache lines that miss
Table the L1 data cache.
18-71 This event can count occurrences for this core or both cores.
and This event can also count accesses to cache lines at different
Table MESI states.
18-74
2EH See L2_RQSTS L2 cache requests. This event counts all completed L2 cache requests. This
Table includes L1 data cache reads, writes, and locked accesses, L1
18-71, data prefetch requests, instruction fetches, and all L2 hardware
Table prefetch requests.
18-73 This event can count occurrences
and
- for this core or both cores.
Table
18-74 - due to demand requests and L2 hardware prefetch requests
together, or separately.
- of accesses to cache lines at different MESI states.
19-202 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
2EH 41H L2_RQSTS.SELF.DEMA L2 cache demand This event counts all completed L2 cache demand requests
ND.I_STATE requests from this core from this core that miss the L2 cache. This includes L1 data
that missed the L2. cache reads, writes, and locked accesses, L1 data prefetch
requests, and instruction fetches.
This is an architectural performance event.
2EH 4FH L2_RQSTS.SELF.DEMA L2 cache demand This event counts all completed L2 cache demand requests
ND.MESI requests from this from this core. This includes L1 data cache reads, writes, and
core. locked accesses, L1 data prefetch requests, and instruction
fetches.
This is an architectural performance event.
30H See L2_REJECT_BUSQ Rejected L2 cache This event indicates that a pending L2 cache request that
Table requests. requires a bus transaction is delayed from moving to the bus
18-71, queue. Some of the reasons for this event are:
Table - The bus queue is full.
18-73
- The bus queue already holds an entry for a cache line in the
and
same set.
Table
18-74 The number of events is greater or equal to the number of
requests that were rejected.
- For this core or both cores.
- Due to demand requests and L2 hardware prefetch requests
together, or separately.
- Of accesses to cache lines at different MESI states.
32H See L2_NO_REQ Cycles no L2 cache This event counts the number of cycles that no L2 cache
Table requests are pending. requests are pending.
18-71
3AH 00H EIST_TRANS Number of Enhanced This event counts the number of Enhanced Intel SpeedStep(R)
Intel SpeedStep(R) Technology (EIST) transitions that include a frequency change,
Technology (EIST) either with or without VID change. This event is incremented
transitions. only while the counting core is in C0 state. In situations where
an EIST transition was caused by hardware as a result of CxE
state transitions, those EIST transitions will also be registered
in this event.
Enhanced Intel Speedstep Technology transitions are commonly
initiated by OS, but can be initiated by HW internally. For
example: CxE states are C-states (C1,C2,C3…) which not only
place the CPU into a sleep state by turning off the clock and
other components, but also lower the voltage (which reduces
the leakage power consumption). The same is true for thermal
throttling transition which uses Enhanced Intel Speedstep
Technology internally.
3BH C0H THERMAL_TRIP Number of thermal This event counts the number of thermal trips. A thermal trip
trips. occurs whenever the processor temperature exceeds the
thermal trip threshold temperature. Following a thermal trip,
the processor automatically reduces frequency and voltage.
The processor checks the temperature every millisecond, and
returns to normal when the temperature falls below the
thermal trip threshold temperature.
Vol. 3B 19-203
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
3CH 00H CPU_CLK_UNHALTED.C Core cycles when core This event counts the number of core cycles while the core is
ORE_P is not halted. not in a halt state. The core enters the halt state when it is
running the HLT instruction. This event is a component in many
key event ratios.
In mobile systems the core frequency may change from time to
time. For this reason this event may have a changing ratio with
regards to time. In systems with a constant core frequency, this
event can give you a measurement of the elapsed time while
the core was not in halt state by dividing the event count by the
core frequency.
-This is an architectural performance event.
- The event CPU_CLK_UNHALTED.CORE_P is counted by a
programmable counter.
- The event CPU_CLK_UNHALTED.CORE is counted by a
designated fixed counter, leaving the two programmable
counters available for other events.
3CH 01H CPU_CLK_UNHALTED.B Bus cycles when core is This event counts the number of bus cycles while the core is not
US not halted. in the halt state. This event can give you a measurement of the
elapsed time while the core was not in the halt state, by
dividing the event count by the bus frequency. The core enters
the halt state when it is running the HLT instruction.
The event also has a constant ratio with
CPU_CLK_UNHALTED.REF event, which is the maximum bus to
processor frequency ratio.
Non-halted bus cycles are a component in many key event
ratios.
3CH 02H CPU_CLK_UNHALTED. Bus cycles when core is This event counts the number of bus cycles during which the
NO_OTHER active and the other is core remains non-halted, and the other core on the processor is
halted. halted.
This event can be used to determine the amount of parallelism
exploited by an application or a system. Divide this event count
by the bus frequency to determine the amount of time that
only one core was in use.
40H 21H L1D_CACHE.LD L1 Cacheable Data This event counts the number of data reads from cacheable
Reads. memory.
40H 22H L1D_CACHE.ST L1 Cacheable Data This event counts the number of data writes to cacheable
Writes. memory.
60H See BUS_REQUEST_OUTST Outstanding cacheable This event counts the number of pending full cache line read
Table ANDING data read bus requests transactions on the bus occurring in each cycle. A read
18-71 duration. transaction is pending from the cycle it is sent on the bus until
and the full cache line is received by the processor. NOTE: This
Table event is thread-independent and will not provide a count per
18-72. logical processor when AnyThr is disabled.
19-204 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
61H See BUS_BNR_DRV Number of Bus Not This event counts the number of Bus Not Ready (BNR) signals
Table Ready signals asserted. that the processor asserts on the bus to suspend additional bus
18-72. requests by other bus agents. A bus agent asserts the BNR
signal when the number of data and snoop transactions is close
to the maximum that the bus can handle.
While this signal is asserted, new transactions cannot be
submitted on the bus. As a result, transaction latency may have
higher impact on program performance. NOTE: This event is
thread-independent and will not provide a count per logical
processor when AnyThr is disabled.
62H See BUS_DRDY_CLOCKS Bus cycles when data This event counts the number of bus cycles during which the
Table is sent on the bus. DRDY (Data Ready) signal is asserted on the bus. The DRDY
18-72. signal is asserted when data is sent on the bus.
This event counts the number of bus cycles during which this
agent (the processor) writes data on the bus back to memory or
to other bus agents. This includes all explicit and implicit data
writebacks, as well as partial writes.
Note: This event is thread-independent and will not provide a
count per logical processor when AnyThr is disabled.
63H See BUS_LOCK_CLOCKS Bus cycles when a This event counts the number of bus cycles, during which the
Table LOCK signal is asserted. LOCK signal is asserted on the bus. A LOCK signal is asserted
18-71 when there is a locked memory access, due to:
and - Uncacheable memory.
Table
- Locked operation that spans two cache lines.
18-72.
- Page-walk from an uncacheable page table.
Bus locks have a very high performance penalty and it is highly
recommended to avoid such accesses. NOTE: This event is
thread-independent and will not provide a count per logical
processor when AnyThr is disabled.
64H See BUS_DATA_RCV Bus cycles while This event counts the number of cycles during which the
Table processor receives processor is busy receiving data. NOTE: This event is thread-
18-71. data. independent and will not provide a count per logical processor
when AnyThr is disabled.
65H See BUS_TRANS_BRD Burst read bus This event counts the number of burst read transactions
Table transactions. including:
18-71 - L1 data cache read misses (and L1 data cache hardware
and prefetches).
Table
- L2 hardware prefetches by the DPL and L2 streamer.
18-72.
- IFU read misses of cacheable lines.
It does not include RFO transactions.
66H See BUS_TRANS_RFO RFO bus transactions. This event counts the number of Read For Ownership (RFO) bus
Table transactions, due to store operations that miss the L1 data
18-71 cache and the L2 cache. This event also counts RFO bus
and transactions due to locked operations.
Table
18-72.
Vol. 3B 19-205
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
67H See BUS_TRANS_WB Explicit writeback bus This event counts all explicit writeback bus transactions due to
Table transactions. dirty line evictions. It does not count implicit writebacks due to
18-71 invalidation by a snoop request.
and
Table
18-72.
68H See BUS_TRANS_IFETCH Instruction-fetch bus This event counts all instruction fetch full cache line bus
Table transactions. transactions.
18-71
and
Table
18-72.
69H See BUS_TRANS_INVAL Invalidate bus This event counts all invalidate transactions. Invalidate
Table transactions. transactions are generated when:
18-71 - A store operation hits a shared line in the L2 cache.
and
- A full cache line write misses the L2 cache or hits a shared line
Table
in the L2 cache.
18-72.
6AH See BUS_TRANS_PWR Partial write bus This event counts partial write bus transactions.
Table transaction.
18-71
and
Table
18-72.
6BH See BUS_TRANS_P Partial bus This event counts all (read and write) partial bus transactions.
Table transactions.
18-71
and
Table
18-72.
6CH See BUS_TRANS_IO IO bus transactions. This event counts the number of completed I/O bus
Table transactions as a result of IN and OUT instructions. The count
18-71 does not include memory mapped IO.
and
Table
18-72.
6DH See BUS_TRANS_DEF Deferred bus This event counts the number of deferred transactions.
Table transactions.
18-71
and
Table
18-72.
6EH See BUS_TRANS_BURST Burst (full cache-line) This event counts burst (full cache line) transactions including:
Table bus transactions. - Burst reads.
18-71
- RFOs.
and
Table - Explicit writebacks.
18-72. - Write combine lines.
19-206 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
6FH See BUS_TRANS_MEM Memory bus This event counts all memory bus transactions including:
Table transactions. - Burst transactions.
18-71
- Partial reads and writes.
and
Table - Invalidate transactions.
18-72. The BUS_TRANS_MEM count is the sum of
BUS_TRANS_BURST, BUS_TRANS_P and BUS_TRANS_INVAL.
70H See BUS_TRANS_ANY All bus transactions. This event counts all bus transactions. This includes:
Table - Memory transactions.
18-71
- IO transactions (non memory-mapped).
and
Table - Deferred transaction completion.
18-72. - Other less frequent transactions, such as interrupts.
77H See EXT_SNOOP External snoops. This event counts the snoop responses to bus transactions.
Table Responses can be counted separately by type and by bus agent.
18-71 Note: This event is thread-independent and will not provide a
and count per logical processor when AnyThr is disabled.
Table
18-74.
7AH See BUS_HIT_DRV HIT signal asserted. This event counts the number of bus cycles during which the
Table processor drives the HIT# pin to signal HIT snoop response.
18-72. Note: This event is thread-independent and will not provide a
count per logical processor when AnyThr is disabled.
7BH See BUS_HITM_DRV HITM signal asserted. This event counts the number of bus cycles during which the
Table processor drives the HITM# pin to signal HITM snoop response.
18-72. NOTE: This event is thread-independent and will not provide a
count per logical processor when AnyThr is disabled.
7DH See BUSQ_EMPTY Bus queue is empty. This event counts the number of cycles during which the core
Table did not have any pending transactions in the bus queue.
18-71. Note: This event is thread-independent and will not provide a
count per logical processor when AnyThr is disabled.
7EH See SNOOP_STALL_DRV Bus stalled for snoops. This event counts the number of times that the bus snoop stall
Table signal is asserted. During the snoop stall cycles no new bus
18-71 transactions requiring a snoop response can be initiated on the
and bus.
Table Note: This event is thread-independent and will not provide a
18-72. count per logical processor when AnyThr is disabled.
7FH See BUS_IO_WAIT IO requests waiting in This event counts the number of core cycles during which IO
Table the bus queue. requests wait in the bus queue. This event counts IO requests
18-71. from the core.
80H 03H ICACHE.ACCESSES Instruction fetches. This event counts all instruction fetches, including uncacheable
fetches.
80H 02H ICACHE.MISSES Icache miss. This event counts all instruction fetches that miss the
Instruction cache or produce memory requests. This includes
uncacheable fetches. An instruction fetch miss is counted only
once and not once for every cycle it is outstanding.
82H 04H ITLB.FLUSH ITLB flushes. This event counts the number of ITLB flushes.
82H 02H ITLB.MISSES ITLB misses. This event counts the number of instruction fetches that miss
the ITLB.
Vol. 3B 19-207
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
AAH 02H MACRO_INSTS.CISC_DE CISC macro instructions This event counts the number of complex instructions decoded,
CODED decoded. but not necessarily executed or retired. Only one complex
instruction can be decoded at a time.
AAH 03H MACRO_INSTS.ALL_DE All Instructions This event counts the number of instructions decoded.
CODED decoded.
B0H 00H SIMD_UOPS_EXEC.S SIMD micro-ops This event counts all the SIMD micro-ops executed. This event
executed (excluding does not count MOVQ and MOVD stores from register to
stores). memory.
B0H 80H SIMD_UOPS_EXEC.AR SIMD micro-ops retired This event counts the number of SIMD saturated arithmetic
(excluding stores). micro-ops executed.
B1H 00H SIMD_SAT_UOP_EXEC. SIMD saturated This event counts the number of SIMD saturated arithmetic
S arithmetic micro-ops micro-ops executed.
executed.
B1H 80H SIMD_SAT_UOP_EXEC. SIMD saturated This event counts the number of SIMD saturated arithmetic
AR arithmetic micro-ops micro-ops retired.
retired.
B3H 01H SIMD_UOP_TYPE_EXE SIMD packed multiply This event counts the number of SIMD packed multiply micro-
C.MUL.S micro-ops executed. ops executed.
B3H 81H SIMD_UOP_TYPE_EXE SIMD packed multiply This event counts the number of SIMD packed multiply micro-
C.MUL.AR micro-ops retired. ops retired.
B3H 02H SIMD_UOP_TYPE_EXE SIMD packed shift This event counts the number of SIMD packed shift micro-ops
C.SHIFT.S micro-ops executed. executed.
B3H 82H SIMD_UOP_TYPE_EXE SIMD packed shift This event counts the number of SIMD packed shift micro-ops
C.SHIFT.AR micro-ops retired. retired.
B3H 04H SIMD_UOP_TYPE_EXE SIMD pack micro-ops This event counts the number of SIMD pack micro-ops executed.
C.PACK.S executed.
B3H 84H SIMD_UOP_TYPE_EXE SIMD pack micro-ops This event counts the number of SIMD pack micro-ops retired.
C.PACK.AR retired.
B3H 08H SIMD_UOP_TYPE_EXE SIMD unpack micro-ops This event counts the number of SIMD unpack micro-ops
C.UNPACK.S executed. executed.
B3H 88H SIMD_UOP_TYPE_EXE SIMD unpack micro-ops This event counts the number of SIMD unpack micro-ops retired.
C.UNPACK.AR retired.
B3H 10H SIMD_UOP_TYPE_EXE SIMD packed logical This event counts the number of SIMD packed logical micro-ops
C.LOGICAL.S micro-ops executed. executed.
B3H 90H SIMD_UOP_TYPE_EXE SIMD packed logical This event counts the number of SIMD packed logical micro-ops
C.LOGICAL.AR micro-ops retired. retired.
B3H 20H SIMD_UOP_TYPE_EXE SIMD packed arithmetic This event counts the number of SIMD packed arithmetic micro-
C.ARITHMETIC.S micro-ops executed. ops executed.
B3H A0H SIMD_UOP_TYPE_EXE SIMD packed arithmetic This event counts the number of SIMD packed arithmetic micro-
C.ARITHMETIC.AR micro-ops retired. ops retired.
C0H 00H INST_RETIRED.ANY_P Instructions retired This event counts the number of instructions that retire
(precise event). execution. For instructions that consist of multiple micro-ops,
this event counts the retirement of the last micro-op of the
instruction. The counter continues counting during hardware
interrupts, traps, and inside interrupt handlers.
19-208 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
N/A 00H INST_RETIRED.ANY Instructions retired. This event counts the number of instructions that retire
execution. For instructions that consist of multiple micro-ops,
this event counts the retirement of the last micro-op of the
instruction. The counter continues counting during hardware
interrupts, traps, and inside interrupt handlers.
C2H 10H UOPS_RETIRED.ANY Micro-ops retired. This event counts the number of micro-ops retired. The
processor decodes complex macro instructions into a sequence
of simpler micro-ops. Most instructions are composed of one or
two micro-ops. Some instructions are decoded into longer
sequences such as repeat instructions, floating point
transcendental instructions, and assists. In some cases micro-op
sequences are fused or whole instructions are fused into one
micro-op. See other UOPS_RETIRED events for differentiating
retired fused and non-fused micro-ops.
C3H 01H MACHINE_CLEARS.SMC Self-Modifying Code This event counts the number of times that a program writes to
detected. a code section. Self-modifying code causes a severe penalty in
all Intel® architecture processors.
C4H 00H BR_INST_RETIRED.AN Retired branch This event counts the number of branch instructions retired.
Y instructions. This is an architectural performance event.
C4H 01H BR_INST_RETIRED.PRE Retired branch This event counts the number of branch instructions retired
D_NOT_TAKEN instructions that were that were correctly predicted to be not-taken.
predicted not-taken.
C4H 02H BR_INST_RETIRED.MIS Retired branch This event counts the number of branch instructions retired
PRED_NOT_TAKEN instructions that were that were mispredicted and not-taken.
mispredicted not-
taken.
C4H 04H BR_INST_RETIRED.PRE Retired branch This event counts the number of branch instructions retired
D_TAKEN instructions that were that were correctly predicted to be taken.
predicted taken.
C4H 08H BR_INST_RETIRED.MIS Retired branch This event counts the number of branch instructions retired
PRED_TAKEN instructions that were that were mispredicted and taken.
mispredicted taken.
C4H 0AH BR_INST_RETIRED.MIS Retired mispredicted This event counts the number of retired branch instructions
PRED branch instructions that were mispredicted by the processor. A branch
(precise event). misprediction occurs when the processor predicts that the
branch would be taken, but it is not, or vice-versa. Mispredicted
branches degrade the performance because the processor
starts executing instructions along a wrong path it predicts.
When the misprediction is discovered, all the instructions
executed in the wrong path must be discarded, and the
processor must start again on the correct path.
Using the Profile-Guided Optimization (PGO) features of the
Intel® C++ compiler may help reduce branch mispredictions. See
the compiler documentation for more information on this
feature.
Vol. 3B 19-209
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
To determine the branch misprediction ratio, divide the
BR_INST_RETIRED.MISPRED event count by the number of
BR_INST_RETIRED.ANY event count. To determine the number
of mispredicted branches per instruction, divide the number of
mispredicted branches by the INST_RETIRED.ANY event count.
To measure the impact of the branch mispredictions use the
event RESOURCE_STALLS.BR_MISS_CLEAR.
Tips:
- See the optimization guide for tips on reducing branch
mispredictions.
- PGO's purpose is to have straight line code for the most
frequent execution paths, reducing branches taken and
increasing the “basic block” size, possibly also reducing the code
footprint or working-set.
C4H 0CH BR_INST_RETIRED.TAK Retired taken branch This event counts the number of branches retired that were
EN instructions. taken.
C4H 0FH BR_INST_RETIRED.AN Retired branch This event counts the number of branch instructions retired
Y1 instructions. that were mispredicted. This event is a duplicate of
BR_INST_RETIRED.MISPRED.
C5H 00H BR_INST_RETIRED.MIS Retired mispredicted This event counts the number of retired branch instructions
PRED branch instructions that were mispredicted by the processor. A branch
(precise event). misprediction occurs when the processor predicts that the
branch would be taken, but it is not, or vice-versa. Mispredicted
branches degrade the performance because the processor
starts executing instructions along a wrong path it predicts.
When the misprediction is discovered, all the instructions
executed in the wrong path must be discarded, and the
processor must start again on the correct path.
Using the Profile-Guided Optimization (PGO) features of the
Intel® C++ compiler may help reduce branch mispredictions. See
the compiler documentation for more information on this
feature.
To determine the branch misprediction ratio, divide the
BR_INST_RETIRED.MISPRED event count by the number of
BR_INST_RETIRED.ANY event count. To determine the number
of mispredicted branches per instruction, divide the number of
mispredicted branches by the INST_RETIRED.ANY event count.
To measure the impact of the branch mispredictions use the
event RESOURCE_STALLS.BR_MISS_CLEAR.
Tips:
- See the optimization guide for tips on reducing branch
mispredictions.
- PGO's purpose is to have straight line code for the most
frequent execution paths, reducing branches taken and
increasing the “basic block” size, possibly also reducing the code
footprint or working-set.
C6H 01H CYCLES_INT_MASKED. Cycles during which This event counts the number of cycles during which interrupts
CYCLES_INT_MASKED interrupts are disabled. are disabled.
C6H 02H CYCLES_INT_MASKED. Cycles during which This event counts the number of cycles during which there are
CYCLES_INT_PENDING interrupts are pending pending interrupts but interrupts are disabled.
_AND_MASKED and disabled.
19-210 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
C7H 01H SIMD_INST_RETIRED.P Retired Streaming This event counts the number of SSE packed-single instructions
ACKED_SINGLE SIMD Extensions (SSE) retired.
packed-single
instructions.
C7H 02H SIMD_INST_RETIRED.S Retired Streaming This event counts the number of SSE scalar-single instructions
CALAR_SINGLE SIMD Extensions (SSE) retired.
scalar-single
instructions.
C7H 04H SIMD_INST_RETIRED.P Retired Streaming This event counts the number of SSE2 packed-double
ACKED_DOUBLE SIMD Extensions 2 instructions retired.
(SSE2) packed-double
instructions.
C7H 08H SIMD_INST_RETIRED.S Retired Streaming This event counts the number of SSE2 scalar-double
CALAR_DOUBLE SIMD Extensions 2 instructions retired.
(SSE2) scalar-double
instructions.
C7H 10H SIMD_INST_RETIRED.V Retired Streaming This event counts the number of SSE2 vector instructions
ECTOR SIMD Extensions 2 retired.
(SSE2) vector
instructions.
C7H 1FH SIMD_INST_RETIRED.A Retired Streaming This event counts the overall number of SIMD instructions
NY SIMD instructions. retired. To count each type of SIMD instruction separately, use
the following events:
SIMD_INST_RETIRED.PACKED_SINGLE
SIMD_INST_RETIRED.SCALAR_SINGLE
SIMD_INST_RETIRED.PACKED_DOUBLE
SIMD_INST_RETIRED.SCALAR_DOUBLE
SIMD_INST_RETIRED.VECTOR.
C8H 00H HW_INT_RCV Hardware interrupts This event counts the number of hardware interrupts received
received. by the processor. This event will count twice for dual-pipe
micro-ops.
CAH 01H SIMD_COMP_INST_RET Retired computational This event counts the number of computational SSE packed-
IRED.PACKED_SINGLE Streaming SIMD single instructions retired. Computational instructions perform
Extensions (SSE) arithmetic computations, like add, multiply and divide.
packed-single Instructions that perform load and store operations or logical
instructions. operations, like XOR, OR, and AND are not counted by this
event.
CAH 02H SIMD_COMP_INST_RET Retired computational This event counts the number of computational SSE scalar-
IRED.SCALAR_SINGLE Streaming SIMD single instructions retired. Computational instructions perform
Extensions (SSE) arithmetic computations, like add, multiply and divide.
scalar-single Instructions that perform load and store operations or logical
instructions. operations, like XOR, OR, and AND are not counted by this
event.
CAH 04H SIMD_COMP_INST_RET Retired computational This event counts the number of computational SSE2 packed-
IRED.PACKED_DOUBLE Streaming SIMD double instructions retired. Computational instructions perform
Extensions 2 (SSE2) arithmetic computations, like add, multiply and divide.
packed-double Instructions that perform load and store operations or logical
instructions. operations, like XOR, OR, and AND are not counted by this
event.
Vol. 3B 19-211
PERFORMANCE MONITORING EVENTS
Table 19-31. Performance Events for 45 nm, 32 nm Intel® Atom™ Processors (Contd.)
Event Umask
Num. Value Event Name Definition Description and Comment
CAH 08H SIMD_COMP_INST_RET Retired computational This event counts the number of computational SSE2 scalar-
IRED.SCALAR_DOUBLE Streaming SIMD double instructions retired. Computational instructions perform
Extensions 2 (SSE2) arithmetic computations, like add, multiply and divide.
scalar-double Instructions that perform load and store operations or logical
instructions. operations, like XOR, OR, and AND are not counted by this
event.
CBH 01H MEM_LOAD_RETIRED.L Retired loads that hit This event counts the number of retired load operations that
2_HIT the L2 cache (precise missed the L1 data cache and hit the L2 cache.
event).
CBH 02H MEM_LOAD_RETIRED.L Retired loads that miss This event counts the number of retired load operations that
2_MISS the L2 cache (precise missed the L2 cache.
event).
CBH 04H MEM_LOAD_RETIRED.D Retired loads that miss This event counts the number of retired loads that missed the
TLB_MISS the DTLB (precise DTLB. The DTLB miss is not counted if the load operation causes
event). a fault.
CDH 00H SIMD_ASSIST SIMD assists invoked. This event counts the number of SIMD assists invoked. SIMD
assists are invoked when an EMMS instruction is executed after
MMX™ technology code has changed the MMX state in the
floating point stack. For example, these assists are required in
the following cases.
Streaming SIMD Extensions (SSE) instructions:
1. Denormal input when the DAZ (Denormals Are Zeros) flag is
off.
2. Underflow result when the FTZ (Flush To Zero) flag is off.
CEH 00H SIMD_INSTR_RETIRED SIMD Instructions This event counts the number of SIMD instructions that retired.
retired.
CFH 00H SIMD_SAT_INSTR_RETI Saturated arithmetic This event counts the number of saturated arithmetic SIMD
RED instructions retired. instructions that retired.
E0H 01H BR_INST_DECODED Branch instructions This event counts the number of branch instructions decoded.
decoded.
E4H 01H BOGUS_BR Bogus branches. This event counts the number of byte sequences that were
mistakenly detected as taken branch instructions. This results
in a BACLEAR event and the BTB is flushed. This occurs mainly
after task switches.
E6H 01H BACLEARS.ANY BACLEARS asserted. This event counts the number of times the front end is
redirected for a branch prediction, mainly when an early branch
prediction is corrected by other branch handling mechanisms in
the front end. This can occur if the code has many branches
such that they cannot be consumed by the branch predictor.
Each Baclear asserted costs approximately 7 cycles. The effect
on total execution time depends on the surrounding code.
19-212 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-32. Performance Events in Intel® Core™ Solo and Intel® Core™ Duo Processors
Event Event Mask Umask
Num. Mnemonic Value Description Comment
03H LD_Blocks 00H Load operations delayed due to store buffer blocks.
The preceding store may be blocked due to
unknown address, unknown data, or conflict due to
partial overlap between the load and store.
04H SD_Drains 00H Cycles while draining store buffers.
05H Misalign_Mem_Ref 00H Misaligned data memory references (MOB splits of
loads and stores).
06H Seg_Reg_Loads 00H Segment register loads.
07H SSE_PrefNta_Ret 00H SSE software prefetch instruction PREFETCHNTA
retired.
07H SSE_PrefT1_Ret 01H SSE software prefetch instruction PREFETCHT1
retired.
07H SSE_PrefT2_Ret 02H SSE software prefetch instruction PREFETCHT2
retired.
07H SSE_NTStores_Ret 03H SSE streaming store instruction retired.
10H FP_Comps_Op_Exe 00H FP computational Instruction executed. FADD,
FSUB, FCOM, FMULs, MUL, IMUL, FDIVs, DIV, IDIV,
FPREMs, FSQRT are included; but exclude FADD or
FMUL used in the middle of a transcendental
instruction.
11H FP_Assist 00H FP exceptions experienced microcode assists. IA32_PMC1 only.
12H Mul 00H Multiply operations (a speculative count, including IA32_PMC1 only.
FP and integer multiplies).
13H Div 00H Divide operations (a speculative count, including FP IA32_PMC1 only.
and integer divisions).
14H Cycles_Div_Busy 00H Cycles the divider is busy. IA32_PMC0 only.
21H L2_ADS 00H L2 Address strobes. Requires core-
specificity.
22H Dbus_Busy 00H Core cycle during which data bus was busy Requires core-
(increments by 4). specificity.
23H Dbus_Busy_Rd 00H Cycles data bus is busy transferring data to a core Requires core-
(increments by 4). specificity.
24H L2_Lines_In 00H L2 cache lines allocated. Requires core-specificity
and HW prefetch
qualification.
25H L2_M_Lines_In 00H L2 Modified-state cache lines allocated. Requires core-
specificity.
26H L2_Lines_Out 00H L2 cache lines evicted. Requires core-specificity
27H L2_M_Lines_Out 00H L2 Modified-state cache lines evicted. and HW prefetch
qualification.
Vol. 3B 19-213
PERFORMANCE MONITORING EVENTS
Table 19-32. Performance Events in Intel® Core™ Solo and Intel® Core™ Duo Processors (Contd.)
Event Event Mask Umask
Num. Mnemonic Value Description Comment
28H L2_IFetch Requires MESI L2 instruction fetches from instruction fetch unit Requires core-
qualification (includes speculative fetches). specificity.
29H L2_LD Requires MESI L2 cache reads. Requires core-
qualification specificity.
2AH L2_ST Requires MESI L2 cache writes (includes speculation). Requires core-
qualification specificity.
2EH L2_Rqsts Requires MESI L2 cache reference requests. Requires core-
qualification specificity, HW prefetch
30H L2_Reject_Cycles Requires MESI Cycles L2 is busy and rejecting new requests. qualification.
qualification
32H L2_No_Request_ Requires MESI Cycles there is no request to access L2.
Cycles qualification
3AH EST_Trans_All 00H Any Intel Enhanced SpeedStep(R) Technology
transitions.
3AH EST_Trans_All 10H Intel Enhanced SpeedStep Technology frequency
transitions.
3BH Thermal_Trip C0H Duration in a thermal trip based on the current core Use edge trigger to
clock. count occurrence.
3CH NonHlt_Ref_Cycles 01H Non-halted bus cycles.
3CH Serial_Execution_ 02H Non-halted bus cycles of this core executing code
Cycles while the other core is halted.
40H DCache_Cache_LD Requires MESI L1 cacheable data read operations.
qualification
41H DCache_Cache_ST Requires MESI L1 cacheable data write operations.
qualification
42H DCache_Cache_ Requires MESI L1 cacheable lock read operations to invalid state.
Lock qualification
43H Data_Mem_Ref 01H L1 data read and writes of cacheable and non-
cacheable types.
44H Data_Mem_Cache_ 02H L1 data cacheable read and write operations.
Ref
45H DCache_Repl 0FH L1 data cache line replacements.
46H DCache_M_Repl 00H L1 data M-state cache line allocated.
47H DCache_M_Evict 00H L1 data M-state cache line evicted.
48H DCache_Pend_Miss 00H Weighted cycles of L1 miss outstanding. Use Cmask =1 to count
duration.
49H Dtlb_Miss 00H Data references that missed TLB.
4BH SSE_PrefNta_Miss 00H PREFETCHNTA missed all caches.
4BH SSE_PrefT1_Miss 01H PREFETCHT1 missed all caches.
4BH SSE_PrefT2_Miss 02H PREFETCHT2 missed all caches.
4BH SSE_NTStores_ 03H SSE streaming store instruction missed all caches.
Miss
4FH L1_Pref_Req 00H L1 prefetch requests due to DCU cache misses. May overcount if
request re-submitted.
19-214 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-32. Performance Events in Intel® Core™ Solo and Intel® Core™ Duo Processors (Contd.)
Event Event Mask Umask
Num. Mnemonic Value Description Comment
60H Bus_Req_ 00; Requires core- Weighted cycles of cacheable bus data read Use Cmask =1 to count
Outstanding specificity, and agent requests. This event counts full-line read request duration.
specificity from DCU or HW prefetcher, but not RFO, write, Use Umask bit 12 to
instruction fetches, or others. include HWP or exclude
HWP separately.
61H Bus_BNR_Clocks 00H External bus cycles while BNR asserted.
62H Bus_DRDY_Clocks 00H External bus cycles while DRDY asserted. Requires agent
specificity.
63H Bus_Locks_Clocks 00H External bus cycles while bus lock signal asserted. Requires core
specificity.
64H Bus_Data_Rcv 40H Number of data chunks received by this processor.
65H Bus_Trans_Brd See comment. Burst read bus transactions (data or code). Requires core
specificity.
66H Bus_Trans_RFO See comment. Completed read for ownership (RFO) transactions. Requires agent
68H Bus_Trans_Ifetch See comment. Completed instruction fetch transactions. specificity.
Requires core
69H Bus_Trans_Inval See comment. Completed invalidate transactions.
specificity.
6AH Bus_Trans_Pwr See comment. Completed partial write transactions.
Each transaction counts
6BH Bus_Trans_P See comment. Completed partial transactions (include partial read its address strobe.
+ partial write + line write). Retried transaction may
6CH Bus_Trans_IO See comment. Completed I/O transactions (read and write). be counted more than
once.
6DH Bus_Trans_Def 20H Completed defer transactions. Requires core
specificity.
Retried transaction may
be counted more than
once.
67H Bus_Trans_WB C0H Completed writeback transactions from DCU (does Requires agent
not include L2 writebacks). specificity.
6EH Bus_Trans_Burst C0H Completed burst transactions (full line transactions Each transaction counts
include reads, write, RFO, and writebacks). its address strobe.
6FH Bus_Trans_Mem C0H Completed memory transactions. This includes Retried transaction may
Bus_Trans_Burst + Bus_Trans_P+Bus_Trans_Inval. be counted more than
once.
70H Bus_Trans_Any C0H Any completed bus transactions.
77H Bus_Snoops 00H Counts any snoop on the bus. Requires MESI
qualification.
Requires agent
specificity.
78H DCU_Snoop_To_ 01H DCU snoops to share-state L1 cache line due to L1 Requires core
Share misses. specificity.
7DH Bus_Not_In_Use 00H Number of cycles there is no transaction from the Requires core
core. specificity.
7EH Bus_Snoop_Stall 00H Number of bus cycles while bus snoop is stalled.
80H ICache_Reads 00H Number of instruction fetches from ICache,
streaming buffers (both cacheable and uncacheable
fetches).
Vol. 3B 19-215
PERFORMANCE MONITORING EVENTS
Table 19-32. Performance Events in Intel® Core™ Solo and Intel® Core™ Duo Processors (Contd.)
Event Event Mask Umask
Num. Mnemonic Value Description Comment
81H ICache_Misses 00H Number of instruction fetch misses from ICache,
streaming buffers.
85H ITLB_Misses 00H Number of iITLB misses.
86H IFU_Mem_Stall 00H Cycles IFU is stalled while waiting for data from
memory.
87H ILD_Stall 00H Number of instruction length decoder stalls (Counts
number of LCP stalls).
88H Br_Inst_Exec 00H Branch instruction executed (includes speculation).
89H Br_Missp_Exec 00H Branch instructions executed and mispredicted at
execution (includes branches that do not have
prediction or mispredicted).
8AH Br_BAC_Missp_ 00H Branch instructions executed that were
Exec mispredicted at front end.
8BH Br_Cnd_Exec 00H Conditional branch instructions executed.
8CH Br_Cnd_Missp_ 00H Conditional branch instructions executed that were
Exec mispredicted.
8DH Br_Ind_Exec 00H Indirect branch instructions executed.
8EH Br_Ind_Missp_Exec 00H Indirect branch instructions executed that were
mispredicted.
8FH Br_Ret_Exec 00H Return branch instructions executed.
90H Br_Ret_Missp_Exec 00H Return branch instructions executed that were
mispredicted.
91H Br_Ret_BAC_Missp_ 00H Return branch instructions executed that were
Exec mispredicted at the front end.
92H Br_Call_Exec 00H Return call instructions executed.
93H Br_Call_Missp_Exec 00H Return call instructions executed that were
mispredicted.
94H Br_Ind_Call_Exec 00H Indirect call branch instructions executed.
A2H Resource_Stall 00H Cycles while there is a resource related stall
(renaming, buffer entries) as seen by allocator.
B0H MMX_Instr_Exec 00H Number of MMX instructions executed (does not
include MOVQ and MOVD stores).
B1H SIMD_Int_Sat_Exec 00H Number of SIMD Integer saturating instructions
executed.
B3H SIMD_Int_Pmul_ 01H Number of SIMD Integer packed multiply
Exec instructions executed.
B3H SIMD_Int_Psft_Exec 02H Number of SIMD Integer packed shift instructions
executed.
B3H SIMD_Int_Pck_Exec 04H Number of SIMD Integer pack operations instruction
executed.
B3H SIMD_Int_Upck_ 08H Number of SIMD Integer unpack instructions
Exec executed.
B3H SIMD_Int_Plog_ 10H Number of SIMD Integer packed logical instructions
Exec executed.
B3H SIMD_Int_Pari_Exec 20H Number of SIMD Integer packed arithmetic
instructions executed.
19-216 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-32. Performance Events in Intel® Core™ Solo and Intel® Core™ Duo Processors (Contd.)
Event Event Mask Umask
Num. Mnemonic Value Description Comment
C0H Instr_Ret 00H Number of instruction retired (Macro fused
instruction count as 2).
C1H FP_Comp_Instr_Ret 00H Number of FP compute instructions retired (X87 Use IA32_PMC0 only.
instruction or instruction that contains X87
operations).
C2H Uops_Ret 00H Number of micro-ops retired (include fused uops).
C3H SMC_Detected 00H Number of times self-modifying code condition
detected.
C4H Br_Instr_Ret 00H Number of branch instructions retired.
C5H Br_MisPred_Ret 00H Number of mispredicted branch instructions retired.
C6H Cycles_Int_Masked 00H Cycles while interrupt is disabled.
C7H Cycles_Int_Pedning_ 00H Cycles while interrupt is disabled and interrupts are
Masked pending.
C8H HW_Int_Rx 00H Number of hardware interrupts received.
C9H Br_Taken_Ret 00H Number of taken branch instruction retired.
CAH Br_MisPred_Taken_ 00H Number of taken and mispredicted branch
Ret instructions retired.
CCH MMX_FP_Trans 00H Number of transitions from MMX to X87.
CCH FP_MMX_Trans 01H Number of transitions from X87 to MMX.
CDH MMX_Assist 00H Number of EMMS executed.
CEH MMX_Instr_Ret 00H Number of MMX instruction retired.
D0H Instr_Decoded 00H Number of instruction decoded.
D7H ESP_Uops 00H Number of ESP folding instruction decoded.
D8H SIMD_FP_SP_Ret 00H Number of SSE/SSE2 single precision instructions
retired (packed and scalar).
D8H SIMD_FP_SP_S_ 01H Number of SSE/SSE2 scalar single precision
Ret instructions retired.
D8H SIMD_FP_DP_P_ 02H Number of SSE/SSE2 packed double precision
Ret instructions retired.
D8H SIMD_FP_DP_S_ 03H Number of SSE/SSE2 scalar double precision
Ret instructions retired.
D8H SIMD_Int_128_Ret 04H Number of SSE2 128 bit integer instructions
retired.
D9H SIMD_FP_SP_P_ 00H Number of SSE/SSE2 packed single precision
Comp_Ret compute instructions retired (does not include AND,
OR, XOR).
D9H SIMD_FP_SP_S_ 01H Number of SSE/SSE2 scalar single precision
Comp_Ret compute instructions retired (does not include AND,
OR, XOR).
D9H SIMD_FP_DP_P_ 02H Number of SSE/SSE2 packed double precision
Comp_Ret compute instructions retired (does not include AND,
OR, XOR).
D9H SIMD_FP_DP_S_ 03H Number of SSE/SSE2 scalar double precision
Comp_Ret compute instructions retired (does not include AND,
OR, XOR).
Vol. 3B 19-217
PERFORMANCE MONITORING EVENTS
Table 19-32. Performance Events in Intel® Core™ Solo and Intel® Core™ Duo Processors (Contd.)
Event Event Mask Umask
Num. Mnemonic Value Description Comment
DAH Fused_Uops_Ret 00H All fused uops retired.
DAH Fused_Ld_Uops_ 01H Fused load uops retired.
Ret
DAH Fused_St_Uops_Ret 02H Fused store uops retired.
DBH Unfusion 00H Number of unfusion events in the ROB (due to
exception).
E0H Br_Instr_Decoded 00H Branch instructions decoded.
E2H BTB_Misses 00H Number of branches the BTB did not produce a
prediction.
E4H Br_Bogus 00H Number of bogus branches.
E6H BAClears 00H Number of BAClears asserted.
F0H Pref_Rqsts_Up 00H Number of hardware prefetch requests issued in
forward streams.
F8H Pref_Rqsts_Dn 00H Number of hardware prefetch requests issued in
backward streams.
19-218 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-219
PERFORMANCE MONITORING EVENTS
19-220 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-221
PERFORMANCE MONITORING EVENTS
page_walk_type This event counts various types of page walks that the page miss
handler (PMH) performs.
ESCR restrictions MSR_PMH_
ESCR0
MSR_PMH_
ESCR1
Counter numbers ESCR0: 0, 1
per ESCR ESCR1: 2, 3
ESCR Event Select 01H ESCR[31:25]
ESCR Event Mask ESCR[24:9]
Bit
0: DTMISS Page walk for a data TLB miss (either load or store).
Page walk for an instruction TLB miss.
1: ITMISS
CCCR Select 04H CCCR[15:13]
BSQ_cache This event counts cache references (2nd level cache or 3rd level
_reference cache) as seen by the bus unit.
Specify one or more mask bit to select an access according to the
access type (read type includes both load and RFO, write type
includes writebacks and evictions) and the access result (hit, misses).
ESCR restrictions MSR_BSU_
ESCR0
MSR_BSU_
ESCR1
Counter numbers ESCR0: 0, 1
per ESCR ESCR1: 2, 3
ESCR Event Select 0CH ESCR[31:25]
19-222 Vol. 3B
PERFORMANCE MONITORING EVENTS
1: RD_2ndL_HITE Read 2nd level cache hit Exclusive (includes load and RFO).
2: RD_2ndL_HITM Read 2nd level cache hit Modified (includes load and RFO).
3: RD_3rdL_HITS Read 3rd level cache hit Shared (includes load and RFO).
4: RD_3rdL_HITE Read 3rd level cache hit Exclusive (includes load and RFO).
Read 3rd level cache hit Modified (includes load and RFO).
5: RD_3rdL_HITM
ESCR Event Mask 8: RD_2ndL_MISS Read 2nd level cache miss (includes load and RFO).
9: RD_3rdL_MISS Read 3rd level cache miss (includes load and RFO).
10: WR_2ndL_MISS A Writeback lookup from DAC misses the 2nd level cache (unlikely to
happen).
Vol. 3B 19-223
PERFORMANCE MONITORING EVENTS
19-224 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-225
PERFORMANCE MONITORING EVENTS
19-226 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-227
PERFORMANCE MONITORING EVENTS
19-228 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-229
PERFORMANCE MONITORING EVENTS
19-230 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-231
PERFORMANCE MONITORING EVENTS
19-232 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-233
PERFORMANCE MONITORING EVENTS
19-234 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-235
PERFORMANCE MONITORING EVENTS
19-236 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-237
PERFORMANCE MONITORING EVENTS
19-238 Vol. 3B
PERFORMANCE MONITORING EVENTS
Vol. 3B 19-239
PERFORMANCE MONITORING EVENTS
19-240 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-36. List of Metrics Available for Front_end Tagging (For Front_end Event Only)
Front-end metric1 MSR_ Additional MSR Event mask value for
TC_PRECISE_EVENT Front_end_event
MSR Bit field
memory_loads None Set TAGLOADS bit in ESCR corresponding to NBOGUS
event Uop_Type.
memory_stores None Set TAGSTORES bit in the ESCR corresponding NBOGUS
to event Uop_Type.
NOTES:
1. There may be some undercounting of front end events when there is an overflow or underflow of the floating point stack.
Table 19-37. List of Metrics Available for Execution Tagging (For Execution Event Only)
Execution metric Upstream ESCR TagValue in Event mask value for
Upstream ESCR execution_event
packed_SP_retired Set ALL bit in event mask, TagUop bit in ESCR of 1 NBOGUS0
packed_SP_uop.
packed_DP_retired Set ALL bit in event mask, TagUop bit in ESCR of 1 NBOGUS0
packed_DP_uop.
scalar_SP_retired Set ALL bit in event mask, TagUop bit in ESCR of 1 NBOGUS0
scalar_SP_uop.
scalar_DP_retired Set ALL bit in event mask, TagUop bit in ESCR of 1 NBOGUS0
scalar_DP_uop.
128_bit_MMX_retired Set ALL bit in event mask, TagUop bit in ESCR of 1 NBOGUS0
128_bit_MMX_uop.
64_bit_MMX_retired Set ALL bit in event mask, TagUop bit in ESCR of 1 NBOGUS0
64_bit_MMX_uop.
X87_FP_retired Set ALL bit in event mask, TagUop bit in ESCR of 1 NBOGUS0
x87_FP_uop.
X87_SIMD_memory_m Set ALLP0, ALLP2 bits in event mask, TagUop bit in 1 NBOGUS0
oves_retired ESCR of X87_SIMD_ moves_uop.
Table 19-38. List of Metrics Available for Replay Tagging (For Replay Event Only)
IA32_PEBS_ MSR_PEBS_ Event Mask Value for
Replay metric1 ENABLE Field MATRIX_VERT Bit Additional MSR/ Event Replay_event
to Set Field to Set
1stL_cache_load Bit 0, Bit 24, Bit 0 None NBOGUS
_miss_retired Bit 25
2ndL_cache_load Bit 1, Bit 24, Bit 0 None NBOGUS
_miss_retired2 Bit 25
DTLB_load_miss Bit 2, Bit 24, Bit 0 None NBOGUS
_retired Bit 25
DTLB_store_miss Bit 2, Bit 24, Bit 1 None NBOGUS
_retired Bit 25
DTLB_all_miss Bit 2, Bit 24, Bit 0, Bit 1 None NBOGUS
_retired Bit 25
Tagged_mispred_ Bit 15, Bit 16, Bit 24, Bit 4 None NBOGUS
branch Bit 25
Vol. 3B 19-241
PERFORMANCE MONITORING EVENTS
Table 19-38. List of Metrics Available for Replay Tagging (For Replay Event Only) (Contd.)
IA32_PEBS_ MSR_PEBS_ Event Mask Value for
Replay metric1 ENABLE Field MATRIX_VERT Bit Additional MSR/ Event Replay_event
to Set Field to Set
MOB_load Bit 9, Bit 24, Bit 0 Select MOB_load_replay NBOGUS
_replay_retired3 Bit 25 event and set
PARTIAL_DATA and
UNALGN_ADDR bit.
split_load_retired Bit 10, Bit 24, Bit 0 Select load_port_replay NBOGUS
Bit 25 event with the
MSR_SAAT_ESCR1 MSR
and set the SPLIT_LD mask
bit.
split_store_retired Bit 10, Bit 24, Bit 1 Select store_port_replay NBOGUS
Bit 25 event with the
MSR_SAAT_ESCR0 MSR
and set the SPLIT_ST mask
bit.
NOTES:
1. Certain kinds of μops cannot be tagged. These include I/O operations, UC and locked accesses, returns, and far transfers.
2. 2nd-level misses retired does not count all 2nd-level misses. It only includes those references that are found to be misses by the fast
detection logic and not those that are later found to be misses.
3. While there are several causes for a MOB replay, the event counted with this event mask setting is the case where the data from a
load that would otherwise be forwarded is not an aligned subset of the data from a preceding store.
19-242 Vol. 3B
PERFORMANCE MONITORING EVENTS
1: REQ_TYPE1 TS
2: REQ_LEN0 TS
3: REQ_LEN1 TS
5: REQ_IO_TYPE TS
6: REQ_LOCK_TYPE TS
7: REQ_CACHE_TYPE TS
8: REQ_SPLIT_TYPE TS
9: REQ_DEM_TYPE TS
10: REQ_ORD_TYPE TS
11: MEM_TYPE0 TS
12: MEM_TYPE1 TS
13: MEM_TYPE2 TS
Non-Retirement BSQ_cache_reference Bit
0: RD_2ndL_HITS TS
1: RD_2ndL_HITE TS
2: RD_2ndL_HITM TS
3: RD_3rdL_HITS TS
4: RD_3rdL_HITE TS
5: RD_3rdL_HITM TS
6: WR_2ndL_HIT TS
7: WR_3rdL_HIT TS
8: RD_2ndL_MISS TS
9: RD_3rdL_MISS TS
10: WR_2ndL_MISS TS
11: WR_3rdL_MISS TS
Non-Retirement memory_cancel Bit
2: ST_RB_FULL TS
3: 64K_CONF TS
Non-Retirement SSE_input_assist Bit 15: ALL TI
Non-Retirement 64bit_MMX_uop Bit 15: ALL TI
Non-Retirement packed_DP_uop Bit 15: ALL TI
Non-Retirement packed_SP_uop Bit 15: ALL TI
Non-Retirement scalar_DP_uop Bit 15: ALL TI
Non-Retirement scalar_SP_uop Bit 15: ALL TI
Non-Retirement 128bit_MMX_uop Bit 15: ALL TI
Non-Retirement x87_FP_uop Bit 15: ALL TI
Vol. 3B 19-243
PERFORMANCE MONITORING EVENTS
4: ALLP2 TI
Non-Retirement FSB_data_activity Bit
0: DRDY_DRV TI
1: DRDY_OWN TI
2: DRDY_OTHER TI
3: DBSY_DRV TI
4: DBSY_OWN TI
5: DBSY_OTHER TI
Non-Retirement IOQ_allocation Bit
0: ReqA0 TS
1: ReqA1 TS
2: ReqA2 TS
3: ReqA3 TS
4: ReqA4 TS
5: ALL_READ TS
6: ALL_WRITE TS
7: MEM_UC TS
8: MEM_WC TS
9: MEM_WT TS
10: MEM_WP TS
11: MEM_WB TS
13: OWN TS
14: OTHER TS
15: PREFETCH TS
Non-Retirement IOQ_active_entries Bit TS
0: ReqA0
1:ReqA1 TS
2: ReqA2 TS
3: ReqA3 TS
4: ReqA4 TS
5: ALL_READ TS
6: ALL_WRITE TS
7: MEM_UC TS
8: MEM_WC TS
9: MEM_WT TS
10: MEM_WP TS
11: MEM_WB TS
19-244 Vol. 3B
PERFORMANCE MONITORING EVENTS
1: MISS TS
2: HIT_UC TS
Non-Retirement MOB_load_replay Bit
1: NO_STA TS
3: NO_STD TS
4: PARTIAL_DATA TS
5: UNALGN_ADDR TS
Non-Retirement page_walk_type Bit
0: DTMISS TI
1: ITMISS TI
Non-Retirement uop_type Bit
1: TAGLOADS TS
2: TAGSTORES TS
Non-Retirement load_port_replay Bit 1: SPLIT_LD TS
Non-Retirement store_port_replay Bit 1: SPLIT_ST TS
Non-Retirement memory_complete Bit
0: LSC TS
1: SSC TS
2: USC TS
3: ULC TS
Non-Retirement retired_mispred_branch_ Bit
type 0: UNCONDITIONAL TS
1: CONDITIONAL TS
2: CALL TS
3: RETURN TS
4: INDIRECT TS
Non-Retirement retired_branch_type Bit
0: UNCONDITIONAL TS
1: CONDITIONAL TS
2: CALL TS
3: RETURN TS
4: INDIRECT TS
Vol. 3B 19-245
PERFORMANCE MONITORING EVENTS
1: DB TI
2: DI TI
3: BD TI
4: BB TI
5: BI TI
6: ID TI
7: IB TI
Non-Retirement uop_queue_writes Bit
0: FROM_TC_BUILD TS
1: FROM_TC_DELIVER TS
2: FROM_ROM TS
Non-Retirement resource_stall Bit 5: SBFULL TS
Non-Retirement WC_Buffer Bit TI
0: WCB_EVICTS TI
1: WCB_FULL_EVICT TI
2: WCB_HITM_EVICT TI
At Retirement instr_retired Bit
0: NBOGUSNTAG TS
1: NBOGUSTAG TS
2: BOGUSNTAG TS
3: BOGUSTAG TS
At Retirement machine_clear Bit
0: CLEAR TS
2: MOCLEAR TS
6: SMCCLEAR TS
At Retirement front_end_event Bit
0: NBOGUS TS
1: BOGUS TS
At Retirement replay_event Bit
0: NBOGUS TS
1: BOGUS TS
At Retirement execution_event Bit
0: NONBOGUS0 TS
1: NONBOGUS1 TS
19-246 Vol. 3B
PERFORMANCE MONITORING EVENTS
1: FPSO TS
2: POAO TS
3: POAU TS
4: PREA TS
At Retirement branch_retired Bit
0: MMNP TS
1: MMNM TS
2: MMTP TS
3: MMTM TS
At Retirement mispred_branch_retired Bit 0: NBOGUS TS
At Retirement uops_retired Bit
0: NBOGUS TS
1: BOGUS TS
At Retirement instr_completed Bit
0: NBOGUS TS
1: BOGUS TS
Vol. 3B 19-247
PERFORMANCE MONITORING EVENTS
19-248 Vol. 3B
PERFORMANCE MONITORING EVENTS
A number of P6 family processor performance monitoring events are modified for the Pentium M processor. Table
19-41 lists the performance monitoring events that were changed in the Pentium M processor, and differ from
performance monitoring events for the P6 family of processors.
Vol. 3B 19-249
PERFORMANCE MONITORING EVENTS
Table 19-42. Events That Can Be Counted with the P6 Family Performance Monitoring Counters
Event Mnemonic Event Unit
Unit Num. Name Mask Description Comments
Data Cache 43H DATA_MEM_REFS 00H All loads from any memory type. All stores
Unit (DCU) to any memory type. Each part of a split is
counted separately. The internal logic counts
not only memory loads and stores, but also
internal retries.
80-bit floating-point accesses are double
counted, since they are decomposed into a
16-bit exponent load and a 64-bit mantissa
load. Memory accesses are only counted
when they are actually performed (such as a
load that gets squashed because a previous
cache miss is outstanding to the same
address, and which finally gets performed, is
only counted once).
Does not include I/O accesses, or other
nonmemory accesses.
45H DCU_LINES_IN 00H Total lines allocated in DCU.
46H DCU_M_LINES_IN 00H Number of M state lines allocated in DCU.
47H DCU_M_LINES_ 00H Number of M state lines evicted from DCU.
OUT This includes evictions via snoop HITM,
intervention or replacement.
48H DCU_MISS_ 00H Weighted number of cycles while a DCU miss An access that also misses the L2
OUTSTANDING is outstanding, incremented by the number is short-changed by 2 cycles (i.e., if
of outstanding cache misses at any counts N cycles, should be N+2
particular time. cycles).
Cacheable read requests only are Subsequent loads to the same
considered. cache line will not result in any
Uncacheable requests are excluded. additional counts.
Read-for-ownerships are counted, as well as Count value not precise, but still
line fills, invalidates, and stores. useful.
19-250 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-42. Events That Can Be Counted with the P6 Family Performance Monitoring Counters (Contd.)
Event Mnemonic Event Unit
Unit Num. Name Mask Description Comments
The count includes only L2 cacheable
instruction fetches; it does not include UC
instruction fetches.
It does not include ITLB miss accesses.
29H L2_LD MESI Number of L2 data loads.
0FH This event indicates that a normal, unlocked,
load memory access was received by the L2.
It includes only L2 cacheable memory
accesses; it does not include I/O accesses,
other nonmemory accesses, or memory
accesses such as UC/WT memory accesses.
It does include L2 cacheable TLB miss
memory accesses.
2AH L2_ST MESI Number of L2 data stores.
0FH This event indicates that a normal, unlocked,
store memory access was received by the
L2.
it indicates that the DCU sent a read-for-
ownership request to the L2. It also includes
Invalid to Modified requests sent by the DCU
to the L2.
It includes only L2 cacheable memory
accesses; it does not include I/O accesses,
other nonmemory accesses, or memory
accesses such as UC/WT memory accesses.
It includes TLB miss memory accesses.
24H L2_LINES_IN 00H Number of lines allocated in the L2.
26H L2_LINES_OUT 00H Number of lines removed from the L2 for
any reason.
25H L2_M_LINES_INM 00H Number of modified lines allocated in the L2.
27H L2_M_LINES_ 00H Number of modified lines removed from the
OUTM L2 for any reason.
2EH L2_RQSTS MESI Total number of L2 requests.
0FH
21H L2_ADS 00H Number of L2 address strobes.
22H L2_DBUS_BUSY 00H Number of cycles during which the L2 cache
data bus was busy.
23H L2_DBUS_BUSY_ 00H Number of cycles during which the data bus
RD was busy transferring read data from L2 to
the processor.
External 62H BUS_DRDY_ 00H Number of clocks during which DRDY# is Unit Mask = 00H counts bus clocks
Bus Logic CLOCKS (Self) asserted. when the processor is driving
(EBL)2 20H Utilization of the external system data bus DRDY#.
(Any) during data transfers. Unit Mask = 20H counts in
processor clocks when any agent is
driving DRDY#.
Vol. 3B 19-251
PERFORMANCE MONITORING EVENTS
Table 19-42. Events That Can Be Counted with the P6 Family Performance Monitoring Counters (Contd.)
Event Mnemonic Event Unit
Unit Num. Name Mask Description Comments
63H BUS_LOCK_ 00H Number of clocks during which LOCK# is Always counts in processor clocks.
CLOCKS (Self) asserted on the external system bus.3
20H
(Any)
60H BUS_REQ_ 00H Number of bus requests outstanding. Counts only DCU full-line cacheable
OUTSTANDING (Self) This counter is incremented by the number reads, not RFOs, writes, instruction
of cacheable read bus requests outstanding fetches, or anything else. Counts
in any given cycle. “waiting for bus to complete” (last
data chunk received).
65H BUS_TRAN_BRD 00H Number of burst read transactions.
(Self)
20H
(Any)
66H BUS_TRAN_RFO 00H Number of completed read for ownership
(Self) transactions.
20H
(Any)
19-252 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-42. Events That Can Be Counted with the P6 Family Performance Monitoring Counters (Contd.)
Event Mnemonic Event Unit
Unit Num. Name Mask Description Comments
6EH BUS_TRAN_ 00H Number of completed burst transactions.
BURST (Self)
20H
(Any)
Vol. 3B 19-253
PERFORMANCE MONITORING EVENTS
Table 19-42. Events That Can Be Counted with the P6 Family Performance Monitoring Counters (Contd.)
Event Mnemonic Event Unit
Unit Num. Name Mask Description Comments
• If the PC bit is clear, the
processor toggles the BPMipins
when the counter overflows.
• If the clock ratio is not 2:1 or 3:1,
the BPMi pins will not function
for these performance
monitoring counter events.
7EH BUS_SNOOP_ 00H Number of clock cycles during which the bus
STALL (Self) is snoop stalled.
Floating- C1H FLOPS 00H Number of computational floating-point Counter 0 only.
Point Unit operations retired.
Excludes floating-point computational
operations that cause traps or assists.
Includes floating-point computational
operations executed by the assist handler.
Includes internal sub-operations for complex
floating-point instructions like
transcendentals.
Excludes floating-point loads and stores.
10H FP_COMP_OPS_ 00H Number of computational floating-point Counter 0 only.
EXE operations executed.
The number of FADD, FSUB, FCOM, FMULs,
integer MULs and IMULs, FDIVs, FPREMs,
FSQRTS, integer DIVs, and IDIVs.
This number does not include the number of
cycles, but the number of operations.
This event does not distinguish an FADD
used in the middle of a transcendental flow
from a separate FADD instruction.
11H FP_ASSIST 00H Number of floating-point exception cases Counter 1 only.
handled by microcode. This event includes counts due to
speculative execution.
12H MUL 00H Number of multiplies. Counter 1 only.
This count includes integer as well as FP
multiplies and is speculative.
13H DIV 00H Number of divides. Counter 1 only.
This count includes integer as well as FP
divides and is speculative.
14H CYCLES_DIV_ 00H Number of cycles during which the divider is Counter 0 only.
BUSY busy, and cannot accept new divides.
This includes integer and FP divides, FPREM,
FPSQRT, etc. and is speculative.
19-254 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-42. Events That Can Be Counted with the P6 Family Performance Monitoring Counters (Contd.)
Event Mnemonic Event Unit
Unit Num. Name Mask Description Comments
Memory 03H LD_BLOCKS 00H Number of load operations delayed due to
Ordering store buffer blocks.
Includes counts caused by preceding stores
whose addresses are unknown, preceding
stores whose addresses are known but
whose data is unknown, and preceding
stores that conflicts with the load but which
incompletely overlap the load.
04H SB_DRAINS 00H Number of store buffer drain cycles.
Incremented every cycle the store buffer is
draining.
Draining is caused by serializing operations
like CPUID, synchronizing operations like
XCHG, interrupt acknowledgment, as well as
other conditions (such as cache flushing).
05H MISALIGN_ 00H Number of misaligned data memory MISALIGN_MEM_
MEM_REF references. REF is only an approximation to the
Incremented by 1 every cycle, during which true number of misaligned memory
either the processor’s load or store pipeline references.
dispatches a misaligned μop. The value returned is roughly
Counting is performed if it is the first or proportional to the number of
second half, or if it is blocked, squashed, or misaligned memory accesses (the
missed. size of the problem).
In this context, misaligned means crossing a
64-bit boundary.
07H EMON_KNI_PREF Number of Streaming SIMD extensions Counters 0 and 1. Pentium III
_DISPATCHED prefetch/weakly-ordered instructions processor only.
dispatched (speculative prefetches are
included in counting):
00H 0: prefetch NTA
01H 1: prefetch T1
02H 2: prefetch T2
03H 3: weakly ordered stores
4BH EMON_KNI_PREF Number of prefetch/weakly-ordered Counters 0 and 1. Pentium III
_MISS instructions that miss all caches: processor only.
00H 0: prefetch NTA
01H 1: prefetch T1
02H 2: prefetch T2
03H 3: weakly ordered stores
Instruction C0H INST_RETIRED 00H Number of instructions retired. A hardware interrupt received
Decoding during/after the last iteration of
and the REP STOS flow causes the
Retirement counter to undercount by 1
instruction.
An SMI received while executing a
HLT instruction will cause the
performance counter to not count
the RSM instruction and
undercount by 1.
Vol. 3B 19-255
PERFORMANCE MONITORING EVENTS
Table 19-42. Events That Can Be Counted with the P6 Family Performance Monitoring Counters (Contd.)
Event Mnemonic Event Unit
Unit Num. Name Mask Description Comments
C2H UOPS_RETIRED 00H Number of μops retired.
D0H INST_DECODED 00H Number of instructions decoded.
D8H EMON_KNI_INST_ Number of Streaming SIMD extensions Counters 0 and 1. Pentium III
RETIRED retired: processor only.
00H 0: packed & scalar
01H 1: scalar
D9H EMON_KNI_ Number of Streaming SIMD extensions Counters 0 and 1. Pentium III
COMP_ computation instructions retired: processor only.
INST_RET 0: packed and scalar
1: scalar
00H
01H
19-256 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-42. Events That Can Be Counted with the P6 Family Performance Monitoring Counters (Contd.)
Event Mnemonic Event Unit
Unit Num. Name Mask Description Comments
Does not include stalls due to bus queue full,
too many cache misses, etc.
In addition to resource related stalls, this
event counts some other events.
Includes stalls arising during branch
misprediction recovery, such as if retirement
of the mispredicted branch is delayed and
stalls arising while store buffer is draining
from synchronizing operations.
D2H PARTIAL_RAT_ 00H Number of cycles or events for partial stalls.
STALLS This includes flag partial stalls.
Segment 06H SEGMENT_REG_ 00H Number of segment register loads.
Register LOADS
Loads
Clocks 79H CPU_CLK_ 00H Number of cycles during which the
UNHALTED processor is not halted.
MMX Unit B0H MMX_INSTR_ 00H Number of MMX Instructions Executed. Available in Intel Celeron, Pentium II
EXEC and Pentium II Xeon processors
only.
Does not account for MOVQ and
MOVD stores from register to
memory.
B1H MMX_SAT_ 00H Number of MMX Saturating Instructions Available in Pentium II and Pentium
INSTR_EXEC Executed. III processors only.
B2H MMX_UOPS_ 0FH Number of MMX μops Executed. Available in Pentium II and Pentium
EXEC III processors only.
B3H MMX_INSTR_ 01H MMX packed multiply instructions executed. Available in Pentium II and Pentium
TYPE_EXEC MMX packed shift instructions executed. III processors only.
02H MMX pack operation instructions executed.
04H
08H MMX unpack operation instructions
executed.
10H MMX packed logical instructions executed.
MMX packed arithmetic instructions
20H executed.
CCH FP_MMX_TRANS 00H Transitions from MMX instruction to Available in Pentium II and Pentium
floating-point instructions. III processors only.
Transitions from floating-point instructions
01H to MMX instructions.
CDH MMX_ASSIST 00H Number of MMX Assists (that is, the number Available in Pentium II and Pentium
of EMMS instructions executed). III processors only.
CEH MMX_INSTR_RET 00H Number of MMX Instructions Retired. Available in Pentium II processors
only.
Segment D4H SEG_RENAME_ Number of Segment Register Renaming Available in Pentium II and Pentium
Register STALLS Stalls: III processors only.
Renaming
Vol. 3B 19-257
PERFORMANCE MONITORING EVENTS
Table 19-42. Events That Can Be Counted with the P6 Family Performance Monitoring Counters (Contd.)
Event Mnemonic Event Unit
Unit Num. Name Mask Description Comments
02H Segment register ES
04H Segment register DS
08H Segment register FS
0FH Segment register FS
Segment registers
ES + DS + FS + GS
D5H SEG_REG_ Number of Segment Register Renames: Available in Pentium II and Pentium
RENAMES III processors only.
01H Segment register ES
02H Segment register DS
04H Segment register FS
08H Segment register FS
0FH Segment registers
ES + DS + FS + GS
D6H RET_SEG_ 00H Number of segment register rename events Available in Pentium II and Pentium
RENAMES retired. III processors only.
NOTES:
1. Several L2 cache events, where noted, can be further qualified using the Unit Mask (UMSK) field in the PerfEvtSel0 and
PerfEvtSel1 registers. The lower 4 bits of the Unit Mask field are used in conjunction with L2 events to indicate the cache state or
cache states involved.
The P6 family processors identify cache states using the “MESI” protocol and consequently each bit in the Unit Mask field repre-
sents one of the four states: UMSK[3] = M (8H) state, UMSK[2] = E (4H) state, UMSK[1] = S (2H) state, and UMSK[0] = I (1H) state.
UMSK[3:0] = MESI” (FH) should be used to collect data for all states; UMSK = 0H, for the applicable events, will result in nothing
being counted.
2. All of the external bus logic (EBL) events, except where noted, can be further qualified using the Unit Mask (UMSK) field in the
PerfEvtSel0 and PerfEvtSel1 registers.
Bit 5 of the UMSK field is used in conjunction with the EBL events to indicate whether the processor should count transactions that
are self- generated (UMSK[5] = 0) or transactions that result from any processor on the bus (UMSK[5] = 1).
3. L2 cache locks, so it is possible to have a zero count.
NOTE
The events in the table that are shaded are implemented only in the Pentium processor with MMX
technology.
19-258 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-43. Events That Can Be Counted with Pentium Processor Performance Monitoring Counters
Event Mnemonic Event
Num. Name Description Comments
00H DATA_READ Number of memory data reads Split cycle reads are counted individually. Data Memory
(internal data cache hit and miss Reads that are part of TLB miss processing are not
combined). included. These events may occur at a maximum of two
per clock. I/O is not included.
01H DATA_WRITE Number of memory data writes Split cycle writes are counted individually. These events
(internal data cache hit and miss may occur at a maximum of two per clock. I/O is not
combined); I/O not included. included.
0H2 DATA_TLB_MISS Number of misses to the data cache
translation look-aside buffer.
03H DATA_READ_MISS Number of memory read accesses that Additional reads to the same cache line after the first
miss the internal data cache whether BRDY# of the burst line fill is returned but before the final
or not the access is cacheable or (fourth) BRDY# has been returned, will not cause the
noncacheable. counter to be incremented additional times.
Data accesses that are part of TLB miss processing are
not included. Accesses directed to I/O space are not
included.
04H DATA WRITE MISS Number of memory write accesses Data accesses that are part of TLB miss processing are
that miss the internal data cache not included. Accesses directed to I/O space are not
whether or not the access is cacheable included.
or noncacheable.
05H WRITE_HIT_TO_ Number of write hits to exclusive or These are the writes that may be held up if EWBE# is
M-_OR_E- modified lines in the data cache. inactive. These events may occur a maximum of two per
STATE_LINES clock.
06H DATA_CACHE_ Number of dirty lines (all) that are Replacements and internal and external snoops can all
LINES_ written back, regardless of the cause. cause writeback and are counted.
WRITTEN_BACK
07H EXTERNAL_ Number of accepted external snoops Assertions of EADS# outside of the sampling interval are
SNOOPS whether they hit in the code cache or not counted, and no internal snoops are counted.
data cache or neither.
08H EXTERNAL_DATA_ Number of external snoops to the data Snoop hits to a valid line in either the data cache, the data
CACHE_SNOOP_ cache. line fill buffer, or one of the write back buffers are all
HITS counted as hits.
09H MEMORY ACCESSES Number of data memory reads or These accesses are not necessarily run in parallel due to
IN BOTH PIPES writes that are paired in both pipes of cache misses, bank conflicts, etc.
the pipeline.
0AH BANK CONFLICTS Number of actual bank conflicts.
0BH MISALIGNED DATA Number of memory or I/O reads or A 2- or 4-byte access is misaligned when it crosses a 4-
MEMORY OR I/O writes that are misaligned. byte boundary; an 8-byte access is misaligned when it
REFERENCES crosses an 8-byte boundary. Ten byte accesses are
treated as two separate accesses of 8 and 2 bytes each.
0CH CODE READ Number of instruction reads; whether Individual 8-byte noncacheable instruction reads are
the read is cacheable or noncacheable. counted.
0DH CODE TLB MISS Number of instruction reads that miss Individual 8-byte noncacheable instruction reads are
the code TLB whether the read is counted.
cacheable or noncacheable.
0EH CODE CACHE MISS Number of instruction reads that miss Individual 8-byte noncacheable instruction reads are
the internal code cache; whether the counted.
read is cacheable or noncacheable.
Vol. 3B 19-259
PERFORMANCE MONITORING EVENTS
Table 19-43. Events That Can Be Counted with Pentium Processor Performance Monitoring Counters (Contd.)
Event Mnemonic Event
Num. Name Description Comments
0FH ANY SEGMENT Number of writes into any segment Segment loads are caused by explicit segment register
REGISTER LOADED register in real or protected mode load instructions, far control transfers, and task switches.
including the LDTR, GDTR, IDTR, and Far control transfers and task switches causing a privilege
TR. level change will signal this event twice. Interrupts and
exceptions may initiate a far control transfer.
10H Reserved
11H Reserved
12H Branches Number of taken and not taken Also counted as taken branches are serializing
branches, including: conditional instructions, VERR and VERW instructions, some segment
branches, jumps, calls, returns, descriptor loads, hardware interrupts (including FLUSH#),
software interrupts, and interrupt and programmatic exceptions that invoke a trap or fault
returns. handler. The pipe is not necessarily flushed.
The number of branches actually executed is measured,
not the number of predicted branches.
13H BTB_HITS Number of BTB hits that occur. Hits are counted only for those instructions that are
actually executed.
14H TAKEN_BRANCH_ Number of taken branches or BTB hits This event type is a logical OR of taken branches and BTB
OR_BTB_HIT that occur. hits. It represents an event that may cause a hit in the
BTB. Specifically, it is either a candidate for a space in the
BTB or it is already in the BTB.
15H PIPELINE FLUSHES Number of pipeline flushes that occur The counter will not be incremented for serializing
Pipeline flushes are caused by BTB instructions (serializing instructions cause the prefetch
misses on taken branches, queue to be flushed but will not trigger the Pipeline
mispredictions, exceptions, interrupts, Flushed event counter) and software interrupts (software
and some segment descriptor loads. interrupts do not flush the pipeline).
16H INSTRUCTIONS_ Number of instructions executed (up Invocations of a fault handler are considered instructions.
EXECUTED to two per clock). All hardware and software interrupts and exceptions will
also cause the count to be incremented. Repeat prefixed
string instructions will only increment this counter once
despite the fact that the repeat loop executes the same
instruction multiple times until the loop criteria is
satisfied.
This applies to all the Repeat string instruction prefixes
(i.e., REP, REPE, REPZ, REPNE, and REPNZ). This counter
will also only increment once per each HLT instruction
executed regardless of how many cycles the processor
remains in the HALT state.
17H INSTRUCTIONS_ Number of instructions executed in This event is the same as the 16H event except it only
EXECUTED_ V PIPE the V_pipe. counts the number of instructions actually executed in
The event indicates the number of the V-pipe.
instructions that were paired.
18H BUS_CYCLE_ Number of clocks while a bus cycle is in The count includes HLDA, AHOLD, and BOFF# clocks.
DURATION progress.
This event measures bus use.
19H WRITE_BUFFER_ Number of clocks while the pipeline is Full write buffers stall data memory read misses, data
FULL_STALL_ stalled due to full write buffers. memory write misses, and data memory write hits to S-
DURATION state lines. Stalls on I/O accesses are not included.
19-260 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-43. Events That Can Be Counted with Pentium Processor Performance Monitoring Counters (Contd.)
Event Mnemonic Event
Num. Name Description Comments
1AH WAITING_FOR_ Number of clocks while the pipeline is Data TLB Miss processing is also included in the count. The
DATA_MEMORY_ stalled while waiting for data memory pipeline stalls while a data memory read is in progress
READ_STALL_ reads. including attempts to read that are not bypassed while a
DURATION line is being filled.
1BH STALL ON WRITE Number of stalls on writes to E- or M-
TO AN E- OR M- state lines.
STATE LINE
1CH LOCKED BUS CYCLE Number of locked bus cycles that occur Only the read portion of the locked read-modify-write is
as the result of the LOCK prefix or counted. Split locked cycles (SCYC active) count as two
LOCK instruction, page-table updates, separate accesses. Cycles restarted due to BOFF# are not
and descriptor table updates. re-counted.
1DH I/O READ OR WRITE Number of bus cycles directed to I/O Misaligned I/O accesses will generate two bus cycles. Bus
CYCLE space. cycles restarted due to BOFF# are not re-counted.
1EH NONCACHEABLE_ Number of noncacheable instruction or Cycles restarted due to BOFF# are not re-counted.
MEMORY_READS data memory read bus cycles.
The count includes read cycles caused
by TLB misses, but does not include
read cycles to I/O space.
1FH PIPELINE_AGI_ Number of address generation An AGI occurs when the instruction in the execute stage
STALLS interlock (AGI) stalls. of either of U- or V-pipelines is writing to either the index
An AGI occurring in both the U- and V- or base address register of an instruction in the D2
pipelines in the same clock signals this (address generation) stage of either the U- or V- pipelines.
event twice.
20H Reserved
21H Reserved
22H FLOPS Number of floating-point operations Number of floating-point adds, subtracts, multiplies,
that occur. divides, remainders, and square roots are counted. The
transcendental instructions consist of multiple adds and
multiplies and will signal this event multiple times.
Instructions generating the divide-by-zero, negative
square root, special operand, or stack exceptions will not
be counted.
Instructions generating all other floating-point exceptions
will be counted. The integer multiply instructions and
other instructions which use the x87 FPU will be counted.
23H BREAKPOINT Number of matches on register DR0 The counters is incremented regardless if the breakpoints
MATCH ON DR0 breakpoint. are enabled or not. However, if breakpoints are not
REGISTER enabled, code breakpoint matches will not be checked for
instructions executed in the V-pipe and will not cause this
counter to be incremented. (They are checked on
instruction executed in the U-pipe only when breakpoints
are not enabled.)
These events correspond to the signals driven on the
BP[3:0] pins. Refer to Chapter 17, “Debug, Branch Profile,
TSC, and Intel® Resource Director Technology (Intel® RDT)
Features” for more information.
24H BREAKPOINT Number of matches on register DR1 See comment for 23H event.
MATCH ON DR1 breakpoint.
REGISTER
Vol. 3B 19-261
PERFORMANCE MONITORING EVENTS
Table 19-43. Events That Can Be Counted with Pentium Processor Performance Monitoring Counters (Contd.)
Event Mnemonic Event
Num. Name Description Comments
25H BREAKPOINT Number of matches on register DR2 See comment for 23H event.
MATCH ON DR2 breakpoint.
REGISTER
26H BREAKPOINT Number of matches on register DR3 See comment for 23H event.
MATCH ON DR3 breakpoint.
REGISTER
27H HARDWARE Number of taken INTR and NMI
INTERRUPTS interrupts.
28H DATA_READ_OR_ Number of memory data reads and/or Split cycle reads and writes are counted individually. Data
WRITE writes (internal data cache hit and Memory Reads that are part of TLB miss processing are
miss combined). not included. These events may occur at a maximum of
two per clock. I/O is not included.
29H DATA_READ_MISS Number of memory read and/or write Additional reads to the same cache line after the first
OR_WRITE MISS accesses that miss the internal data BRDY# of the burst line fill is returned but before the final
cache, whether or not the access is (fourth) BRDY# has been returned, will not cause the
cacheable or noncacheable. counter to be incremented additional times.
Data accesses that are part of TLB miss processing are
not included. Accesses directed to I/O space are not
included.
2AH BUS_OWNERSHIP_ The time from LRM bus ownership The ratio of the 2AH events counted on counter 0 and
LATENCY request to bus ownership granted counter 1 is the average stall time due to bus ownership
(Counter 0) (that is, the time from the earlier of a conflict.
PBREQ (0), PHITM# or HITM#
assertion to a PBGNT assertion)
2AH BUS OWNERSHIP The number of buss ownership The ratio of the 2AH events counted on counter 0 and
TRANSFERS transfers (that is, the number of counter 1 is the average stall time due to bus ownership
(Counter 1) PBREQ (0) assertions conflict.
2BH MMX_ Number of MMX instructions executed
INSTRUCTIONS_ in the U-pipe
EXECUTED_
U-PIPE (Counter 0)
2BH MMX_ Number of MMX instructions executed
INSTRUCTIONS_ in the V-pipe
EXECUTED_
V-PIPE (Counter 1)
2CH CACHE_M- Number of times a processor identified If the average memory latencies of the system are known,
STATE_LINE_ a hit to a modified line due to a this event enables the user to count the Write Backs on
SHARING memory access in the other processor PHITM(O) penalty and the Latency on Hit Modified(I)
(Counter 0) (PHITM (O)) penalty.
2CH CACHE_LINE_ Number of shared data lines in the L1
SHARING cache (PHIT (O))
(Counter 1)
2DH EMMS_ Number of EMMS instructions
INSTRUCTIONS_ executed
EXECUTED (Counter
0)
19-262 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-43. Events That Can Be Counted with Pentium Processor Performance Monitoring Counters (Contd.)
Event Mnemonic Event
Num. Name Description Comments
2DH TRANSITIONS_ Number of transitions between MMX This event counts the first floating-point instruction
BETWEEN_MMX_ and floating-point instructions or vice following an MMX instruction or first MMX instruction
AND_FP_ versa following a floating-point instruction.
INSTRUCTIONS An even count indicates the processor The count may be used to estimate the penalty in
(Counter 1) is in MMX state. an odd count indicates transitions between floating-point state and MMX state.
it is in FP state.
2EH BUS_UTILIZATION_ Number of clocks the bus is busy due
DUE_TO_ to the processor’s own activity (the
PROCESSOR_ bus activity that is caused by the
ACTIVITY processor)
(Counter 0)
2EH WRITES_TO_ Number of write accesses to The count includes write cycles caused by TLB misses and
NONCACHEABLE_ noncacheable memory I/O write cycles.
MEMORY Cycles restarted due to BOFF# are not re-counted.
(Counter 1)
2FH SATURATING_ Number of saturating MMX
MMX_ instructions executed, independently
INSTRUCTIONS_ of whether they actually saturated.
EXECUTED (Counter
0)
2FH SATURATIONS_ Number of MMX instructions that used If an MMX instruction operating on 4 doublewords
PERFORMED saturating arithmetic when at least saturated in three out of the four results, the counter will
(Counter 1) one of its results actually saturated be incremented by one only.
30H NUMBER_OF_ Number of cycles the processor is not This event will enable the user to calculate “net CPI”. Note
CYCLES_NOT_IN_ idle due to HLT instruction that during the time that the processor is executing the
HALT_STATE HLT instruction, the Time-Stamp Counter is not disabled.
(Counter 0) Since this event is controlled by the Counter Controls CC0,
CC1 it can be used to calculate the CPI at CPL=3, which
the TSC cannot provide.
30H DATA_CACHE_ Number of clocks the pipeline is stalled
TLB_MISS_ due to a data cache translation look-
STALL_DURATION aside buffer (TLB) miss
(Counter 1)
31H MMX_ Number of MMX instruction data reads
INSTRUCTION_
DATA_READS
(Counter 0)
31H MMX_ Number of MMX instruction data read
INSTRUCTION_ misses
DATA_READ_
MISSES
(Counter 1)
32H FLOATING_POINT_S Number of clocks while pipe is stalled
TALLS_DURATION due to a floating-point freeze
(Counter 0)
32H TAKEN_BRANCHES Number of taken branches
(Counter 1)
Vol. 3B 19-263
PERFORMANCE MONITORING EVENTS
Table 19-43. Events That Can Be Counted with Pentium Processor Performance Monitoring Counters (Contd.)
Event Mnemonic Event
Num. Name Description Comments
33H D1_STARVATION_ Number of times D1 stage cannot The D1 stage can issue 0, 1, or 2 instructions per clock if
AND_FIFO_IS_ issue ANY instructions since the FIFO those are available in an instructions FIFO buffer.
EMPTY buffer is empty
(Counter 0)
33H D1_STARVATION_ Number of times the D1 stage issues a The D1 stage can issue 0, 1, or 2 instructions per clock if
AND_ONLY_ONE_ single instruction (since the FIFO those are available in an instructions FIFO buffer.
INSTRUCTION_IN_ buffer had just one instruction ready) When combined with the previously defined events,
FIFO Instruction Executed (16H) and Instruction Executed in
(Counter 1) the V-pipe (17H), this event enables the user to calculate
the numbers of time pairing rules prevented issuing of
two instructions.
34H MMX_ Number of data writes caused by MMX
INSTRUCTION_ instructions
DATA_WRITES
(Counter 0)
34H MMX_ Number of data write misses caused
INSTRUCTION_ by MMX instructions
DATA_WRITE_
MISSES
(Counter 1)
35H PIPELINE_ Number of pipeline flushes due to The count includes any pipeline flush due to a branch that
FLUSHES_DUE_ wrong branch predictions resolved in the pipeline did not follow correctly. It includes cases
TO_WRONG_ either the E-stage or the WB-stage where a branch was not in the BTB, cases where a branch
BRANCH_ was in the BTB but was mispredicted, and cases where a
PREDICTIONS branch was correctly predicted but to the wrong address.
(Counter 0) Branches are resolved in either the Execute stage
(E-stage) or the Writeback stage (WB-stage). In the later
case, the misprediction penalty is larger by one clock. The
difference between the 35H event count in counter 0 and
counter 1 is the number of E-stage resolved branches.
35H PIPELINE_ Number of pipeline flushes due to See note for event 35H (Counter 0).
FLUSHES_DUE_ wrong branch predictions resolved in
TO_WRONG_ the WB-stage
BRANCH_
PREDICTIONS_
RESOLVED_IN_
WB-STAGE
(Counter 1)
36H MISALIGNED_ Number of misaligned data memory
DATA_MEMORY_ references when executing MMX
REFERENCE_ON_ instructions
MMX_
INSTRUCTIONS
(Counter 0)
36H PIPELINE_ Number clocks during pipeline stalls T3:
ISTALL_FOR_MMX_ caused by waits form MMX instruction
INSTRUCTION_ data memory reads
DATA_MEMORY_
READS
(Counter 1)
19-264 Vol. 3B
PERFORMANCE MONITORING EVENTS
Table 19-43. Events That Can Be Counted with Pentium Processor Performance Monitoring Counters (Contd.)
Event Mnemonic Event
Num. Name Description Comments
37H MISPREDICTED_ Number of returns predicted The count is the difference between the total number of
OR_ incorrectly or not predicted at all executed returns and the number of returns that were
UNPREDICTED_ correctly predicted. Only RET instructions are counted (for
RETURNS example, IRET instructions are not counted).
(Counter 1)
37H PREDICTED_ Number of predicted returns (whether Only RET instructions are counted (for example, IRET
RETURNS they are predicted correctly and instructions are not counted).
(Counter 1) incorrectly
38H MMX_MULTIPLY_ Number of clocks the pipe is stalled The counter will not be incremented if there is another
UNIT_INTERLOCK since the destination of previous MMX cause for a stall. For each occurrence of a multiply
(Counter 0) multiply instruction is not ready yet interlock, this event will be counted twice (if the stalled
instruction comes on the next clock after the multiply) or
by once (if the stalled instruction comes two clocks after
the multiply).
38H MOVD/MOVQ_ Number of clocks a MOVD/MOVQ
STORE_STALL_ instruction store is stalled in D2 stage
DUE_TO_ due to a previous MMX operation with
PREVIOUS_MMX_ a destination to be used in the store
OPERATION instruction.
(Counter 1)
39H RETURNS Number or returns executed. Only RET instructions are counted; IRET instructions are
(Counter 0) not counted. Any exception taken on a RET instruction
and any interrupt recognized by the processor on the
instruction boundary prior to the execution of the RET
instruction will also cause this counter to be incremented.
39H Reserved
3AH BTB_FALSE_ Number of false entries in the Branch False entries are causes for misprediction other than a
ENTRIES Target Buffer wrong prediction.
(Counter 0)
3AH BTB_MISS_ Number of times the BTB predicted a
PREDICTION_ON_ not-taken branch as taken
NOT-TAKEN_
BRANCH
(Counter 1)
3BH FULL_WRITE_ Number of clocks while the pipeline is
BUFFER_STALL_ stalled due to full write buffers while
DURATION_ executing MMX instructions
WHILE_
EXECUTING_MMX_I
NSTRUCTIONS
(Counter 0)
3BH STALL_ON_MMX_ Number of clocks during stalls on MMX
INSTRUCTION_ instructions writing to E- or M-state
WRITE_TO E-_OR_ lines
M-STATE_LINE
(Counter 1)
Vol. 3B 19-265
PERFORMANCE MONITORING EVENTS
19-266 Vol. 3B
19.Updates to Chapter 24, Volume 3C
Change bars and green text show changes to Chapter 24 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3C: System Programming Guide, Part 3.
------------------------------------------------------------------------------------------
24.1 OVERVIEW
A logical processor uses virtual-machine control data structures (VMCSs) while it is in VMX operation. These
manage transitions into and out of VMX non-root operation (VM entries and VM exits) as well as processor behavior
in VMX non-root operation. This structure is manipulated by the new instructions VMCLEAR, VMPTRLD, VMREAD,
and VMWRITE.
A VMM can use a different VMCS for each virtual machine that it supports. For a virtual machine with multiple
logical processors (virtual processors), the VMM can use a different VMCS for each virtual processor.
A logical processor associates a region in memory with each VMCS. This region is called the VMCS region.1 Soft-
ware references a specific VMCS using the 64-bit physical address of the region (a VMCS pointer). VMCS pointers
must be aligned on a 4-KByte boundary (bits 11:0 must be zero). These pointers must not set bits beyond the
processor’s physical-address width.2,3
A logical processor may maintain a number of VMCSs that are active. The processor may optimize VMX operation
by maintaining the state of an active VMCS in memory, on the processor, or both. At any given time, at most one
of the active VMCSs is the current VMCS. (This document frequently uses the term “the VMCS” to refer to the
current VMCS.) The VMLAUNCH, VMREAD, VMRESUME, and VMWRITE instructions operate only on the current
VMCS.
The following items describe how a logical processor determines which VMCSs are active and which is current:
• The memory operand of the VMPTRLD instruction is the address of a VMCS. After execution of the instruction,
that VMCS is both active and current on the logical processor. Any other VMCS that had been active remains so,
but no other VMCS is current.
• The VMCS link pointer field in the current VMCS (see Section 24.4.2) is itself the address of a VMCS. If VM entry
is performed successfully with the 1-setting of the “VMCS shadowing” VM-execution control, the VMCS
referenced by the VMCS link pointer field becomes active on the logical processor. The identity of the current
VMCS does not change.
• The memory operand of the VMCLEAR instruction is also the address of a VMCS. After execution of the
instruction, that VMCS is neither active nor current on the logical processor. If the VMCS had been current on
the logical processor, the logical processor no longer has a current VMCS.
The VMPTRST instruction stores the address of the logical processor’s current VMCS into a specified memory loca-
tion (it stores the value FFFFFFFF_FFFFFFFFH if there is no current VMCS).
The launch state of a VMCS determines which VM-entry instruction should be used with that VMCS: the
VMLAUNCH instruction requires a VMCS whose launch state is “clear”; the VMRESUME instruction requires a VMCS
whose launch state is “launched”. A logical processor maintains a VMCS’s launch state in the corresponding VMCS
region. The following items describe how a logical processor manages the launch state of a VMCS:
• If the launch state of the current VMCS is “clear”, successful execution of the VMLAUNCH instruction changes
the launch state to “launched”.
• The memory operand of the VMCLEAR instruction is the address of a VMCS. After execution of the instruction,
the launch state of that VMCS is “clear”.
• There are no other ways to modify the launch state of a VMCS (it cannot be modified using VMWRITE) and there
is no direct way to discover it (it cannot be read using VMREAD).
1. The amount of memory required for a VMCS region is at most 4 KBytes. The exact size is implementation specific and can be deter-
mined by consulting the VMX capability MSR IA32_VMX_BASIC to determine the size of the VMCS region (see Appendix A.1).
2. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
3. If IA32_VMX_BASIC[48] is read as 1, these pointers must not set any bits in the range 63:32; see Appendix A.1.
Vol. 3C 24-1
VIRTUAL MACHINE CONTROL STRUCTURES
Figure 24-1 illustrates the different states of a VMCS. It uses “X” to refer to the VMCS and “Y” to refer to any other
VMCS. Thus: “VMPTRLD X” always makes X current and active; “VMPTRLD Y” always makes X not current (because
it makes Y current); VMLAUNCH makes the launch state of X “launched” if X was current and its launch state was
“clear”; and VMCLEAR X always makes X inactive and not current and makes its launch state “clear”.
The figure does not illustrate operations that do not modify the VMCS state relative to these parameters (e.g.,
execution of VMPTRLD X when X is already current). Note that VMCLEAR X makes X “inactive, not current, and
clear,” even if X’s current state is not defined (e.g., even if X has not yet been initialized). See Section 24.11.3.
VMCLEAR X
VMPTRLD X
VMPTRLD Y
VMPTRLD X
VMPTRLD Y
R X
VM
EA LD
X
CL TR
C
LE
VM M P
A
R
V
Anything
X
Else
Active Active
Current VMLAUNCH Current
Clear Launched
Because a shadow VMCS (see Section 24.10) cannot be used for VM entry, the launch state of a shadow VMCS is
not meaningful. Figure 24-1 does not illustrate all the ways in which a shadow VMCS may be made active.
4 VMX-abort indicator
1. The exact size is implementation specific and can be determined by consulting the VMX capability MSR IA32_VMX_BASIC to deter-
mine the size of the VMCS region (see Appendix A.1).
24-2 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
The first 4 bytes of the VMCS region contain the VMCS revision identifier at bits 30:0.1 Processors that maintain
VMCS data in different formats (see below) use different VMCS revision identifiers. These identifiers enable soft-
ware to avoid using a VMCS region formatted for one processor on a processor that uses a different format.2 Bit 31
of this 4-byte region indicates whether the VMCS is a shadow VMCS (see Section 24.10).
Software should write the VMCS revision identifier to the VMCS region before using that region for a VMCS. The
VMCS revision identifier is never written by the processor; VMPTRLD fails if its operand references a VMCS region
whose VMCS revision identifier differs from that used by the processor. (VMPTRLD also fails if the shadow-VMCS
indicator is 1 and the processor does not support the 1-setting of the “VMCS shadowing” VM-execution control; see
Section 24.6.2) Software can discover the VMCS revision identifier that a processor uses by reading the VMX capa-
bility MSR IA32_VMX_BASIC (see Appendix A.1).
Software should clear or set the shadow-VMCS indicator depending on whether the VMCS is to be an ordinary
VMCS or a shadow VMCS (see Section 24.10). VMPTRLD fails if the shadow-VMCS indicator is set and the processor
does not support the 1-setting of the “VMCS shadowing” VM-execution control. Software can discover support for
this setting by reading the VMX capability MSR IA32_VMX_PROCBASED_CTLS2 (see Appendix A.3.3).
The next 4 bytes of the VMCS region are used for the VMX-abort indicator. The contents of these bits do not
control processor operation in any way. A logical processor writes a non-zero value into these bits if a VMX abort
occurs (see Section 27.7). Software may also write into this field.
The remainder of the VMCS region is used for VMCS data (those parts of the VMCS that control VMX non-root
operation and the VMX transitions). The format of these data is implementation-specific. VMCS data are discussed
in Section 24.3 through Section 24.9. To ensure proper behavior in VMX operation, software should maintain the
VMCS region and related structures (enumerated in Section 24.11.4) in writeback cacheable memory. Future
implementations may allow or require a different memory type3. Software should consult the VMX capability MSR
IA32_VMX_BASIC (see Appendix A.1).
1. Earlier versions of this manual specified that the VMCS revision identifier was a 32-bit field. For all processors produced prior to this
change, bit 31 of the VMCS revision identifier was 0.
2. Logical processors that use the same VMCS revision identifier use the same size for VMCS regions.
3. Alternatively, software may map any of these regions or structures with the UC memory type. Doing so is strongly discouraged
unless necessary as it will cause the performance of transitions using those structures to suffer significantly. In addition, the pro-
cessor will continue to use the memory type reported in the VMX capability MSR IA32_VMX_BASIC with exceptions noted in Appen-
dix A.1.
4. Software can discover whether these fields can be written by reading the VMX capability MSR IA32_VMX_MISC (see Appendix A.6).
Vol. 3C 24-3
VIRTUAL MACHINE CONTROL STRUCTURES
7 P — Segment present
11:8 Reserved
1. This chapter uses the notation RAX, RIP, RSP, RFLAGS, etc. for processor registers because most processors that support VMX oper-
ation also support Intel 64 architecture. For processors that do not support Intel 64 architecture, this notation refers to the 32-bit
forms of those registers (EAX, EIP, ESP, EFLAGS, etc.). In a few places, notation such as EAX is used to refer specifically to lower 32
bits of the indicated register.
2. There are a few exceptions to this statement. For example, a segment with a non-null selector may be unusable following a task
switch that fails after its commit point; see “Interrupt 10—Invalid TSS Exception (#TS)” in Section 6.14, “Exception and Interrupt
Handling in 64-bit Mode,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A. In contrast, the TR
register is usable after processor reset despite having a null selector; see Table 10-1 in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 3A.
24-4 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
15 G — Granularity
31:17 Reserved
The base address, segment limit, and access rights compose the “hidden” part (or “descriptor cache”) of each
segment register. These data are included in the VMCS because it is possible for a segment register’s descriptor
cache to be inconsistent with the segment descriptor in memory (in the GDT or the LDT) referenced by the
segment register’s selector.
The value of the DPL field for SS is always equal to the logical processor’s current privilege level (CPL).1
On some processors, executions of VMWRITE ignore attempts to write non-zero values to any of bits 11:8 or
bits 31:17. On such processors, VMREAD always returns 0 for those bits, and VM entry treats those bits as if
they were all 0 (see Section 26.3.1.2).
• The following fields for each of the registers GDTR and IDTR:
— Base address (64 bits; 32 bits on processors that do not support Intel 64 architecture).
— Limit (32 bits). The limit fields contain 32 bits even though these fields are specified as only 16 bits in the
architecture.
• The following MSRs:
— IA32_DEBUGCTL (64 bits)
— IA32_SYSENTER_CS (32 bits)
— IA32_SYSENTER_ESP and IA32_SYSENTER_EIP (64 bits; 32 bits on processors that do not support Intel 64
architecture)
— IA32_PERF_GLOBAL_CTRL (64 bits). This field is supported only on processors that support the 1-setting
of the “load IA32_PERF_GLOBAL_CTRL” VM-entry control.
— IA32_PAT (64 bits). This field is supported only on processors that support either the 1-setting of the “load
IA32_PAT” VM-entry control or that of the “save IA32_PAT” VM-exit control.
— IA32_EFER (64 bits). This field is supported only on processors that support either the 1-setting of the “load
IA32_EFER” VM-entry control or that of the “save IA32_EFER” VM-exit control.
— IA32_BNDCFGS (64 bits). This field is supported only on processors that support either the 1-setting of the
“load IA32_BNDCFGS” VM-entry control or that of the “clear IA32_BNDCFGS” VM-exit control.
— IA32_RTIT_CTL (64 bits). This field is supported only on processors that support either the 1-setting of the
“load IA32_RTIT_CTL” VM-entry control or that of the “clear IA32_RTIT_CTL” VM-exit control.
— IA32_S_CET (64 bits; 32 bits on processors that do not support Intel 64 architecture). This field is
supported only on processors that support the 1-setting of the “load CET state” VM-entry control.
— IA32_INTERRUPT_SSP_TABLE_ADDR (64 bits; 32 bits on processors that do not support Intel 64 archi-
tecture). This field is supported only on processors that support the 1-setting of the “load CET state” VM-
entry control.
1. In protected mode, CPL is also associated with the RPL field in the CS selector. However, the RPL fields are not meaningful in real-
address mode or in virtual-8086 mode.
Vol. 3C 24-5
VIRTUAL MACHINE CONTROL STRUCTURES
• The shadow-stack pointer register SSP (64 bits; 32 bits on processors that do not support Intel 64 archi-
tecture). This field is supported only on processors that support the 1-setting of the “load CET state” VM-entry
control.
• The register SMBASE (32 bits). This register contains the base address of the logical processor’s SMRAM image.
0 Blocking by STI See the “STI—Set Interrupt Flag” section in Chapter 4 of the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 2B.
Execution of STI with RFLAGS.IF = 0 blocks maskable interrupts on the instruction boundary
following its execution.1 Setting this bit indicates that this blocking is in effect.
1 Blocking by See Section 6.8.3, “Masking Exceptions and Interrupts When Switching Stacks,” in the Intel® 64
MOV SS and IA-32 Architectures Software Developer’s Manual, Volume 3A.
Execution of a MOV to SS or a POP to SS blocks or suppresses certain debug exceptions as well
as interrupts (maskable and nonmaskable) on the instruction boundary following its execution.
Setting this bit indicates that this blocking is in effect.2 This document uses the term “blocking
by MOV SS,” but it applies equally to POP SS.
2 Blocking by SMI See Section 34.2, “System Management Interrupt (SMI).” System-management interrupts
(SMIs) are disabled while the processor is in system-management mode (SMM). Setting this bit
indicates that blocking of SMIs is in effect.
1. Execution of the MWAIT instruction may put a logical processor into an inactive state. However, this VMCS field never reflects this
state. See Section 27.1.
2. A triple fault occurs when a logical processor encounters an exception while attempting to deliver a double fault.
24-6 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
3 Blocking by NMI See Section 6.7.1, “Handling Multiple NMIs,” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A and Section 34.8, “NMI Handling While in SMM.”
Delivery of a non-maskable interrupt (NMI) or a system-management interrupt (SMI) blocks
subsequent NMIs until the next execution of IRET. See Section 25.3 for how this behavior of
IRET may change in VMX non-root operation. Setting this bit indicates that blocking of NMIs is
in effect. Clearing this bit does not imply that NMIs are not (temporarily) blocked for other
reasons.
If the “virtual NMIs” VM-execution control (see Section 24.6.1) is 1, this bit does not control the
blocking of NMIs. Instead, it refers to “virtual-NMI blocking” (the fact that guest software is not
ready for an NMI).
4 Enclave Set to 1 if the VM exit occurred while the logical processor was in enclave mode.
interruption Such VM exits includes those caused by interrupts, non-maskable interrupts, system-
management interrupts, INIT signals, and exceptions occurring in enclave mode as well as
exceptions encountered during the delivery of such events incident to enclave mode.
A VM exit that is incident to delivery of an event injected by VM entry leaves this bit
unmodified.
31:5 Reserved VM entry will fail if these bits are not 0. See Section 26.3.1.5.
NOTES:
1. Nonmaskable interrupts and system-management interrupts may also be inhibited on the instruction boundary following such an
execution of STI.
2. System-management interrupts may also be inhibited on the instruction boundary following such an execution of MOV or POP.
• Pending debug exceptions (64 bits; 32 bits on processors that do not support Intel 64 architecture). IA-32
processors may recognize one or more debug exceptions without immediately delivering them.1 This field
contains information about such exceptions. This field is described in Table 24-4.
3:0 B3 – B0 When set, each of these bits indicates that the corresponding breakpoint condition was met.
Any of these bits may be set even if the corresponding enabling bit in DR7 is not set.
11:4 Reserved VM entry fails if these bits are not 0. See Section 26.3.1.5.
12 Enabled When set, this bit indicates that at least one data or I/O breakpoint was met and was enabled in
breakpoint DR7.
14 BS When set, this bit indicates that a debug exception would have been triggered by single-step
execution mode.
1. For example, execution of a MOV to SS or a POP to SS may inhibit some debug exceptions for one instruction. See Section 6.8.3 of
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A. In addition, certain events incident to an instruction
(for example, an INIT signal) may take priority over debug traps generated by that instruction. See Table 6-2 in the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 3A.
Vol. 3C 24-7
VIRTUAL MACHINE CONTROL STRUCTURES
16 RTM When set, this bit indicates that a debug exception (#DB) or a breakpoint exception (#BP)
occurred inside an RTM region while advanced debugging of RTM transactional regions was
enabled (see Section 16.3.7, “RTM-Enabled Debugger Support,” of Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 1).1
63:17 Reserved VM entry fails if these bits are not 0. See Section 26.3.1.5. Bits 63:32 exist only on processors
that support Intel 64 architecture.
NOTES:
1. In general, the format of this field matches that of DR6. However, DR6 clears bit 16 to indicate an RTM-related exception, while this
field sets the bit to indicate that condition.
• VMCS link pointer (64 bits). If the “VMCS shadowing” VM-execution control is 1, the VMREAD and VMWRITE
instructions access the VMCS referenced by this pointer (see Section 24.10). Otherwise, software should set
this field to FFFFFFFF_FFFFFFFFH to avoid VM-entry failures (see Section 26.3.1.5).
• VMX-preemption timer value (32 bits). This field is supported only on processors that support the 1-setting
of the “activate VMX-preemption timer” VM-execution control. This field contains the value that the VMX-
preemption timer will use following the next VM entry with that setting. See Section 25.5.1 and Section 26.7.4.
• Page-directory-pointer-table entries (PDPTEs; 64 bits each). These four (4) fields (PDPTE0, PDPTE1,
PDPTE2, and PDPTE3) are supported only on processors that support the 1-setting of the “enable EPT” VM-
execution control. They correspond to the PDPTEs referenced by CR3 when PAE paging is in use (see Section
4.4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). They are used only if
the “enable EPT” VM-execution control is 1.
• Guest interrupt status (16 bits). This field is supported only on processors that support the 1-setting of the
“virtual-interrupt delivery” VM-execution control. It characterizes part of the guest’s virtual-APIC state and
does not correspond to any processor or APIC registers. It comprises two 8-bit subfields:
— Requesting virtual interrupt (RVI). This is the low byte of the guest interrupt status. The processor
treats this value as the vector of the highest priority virtual interrupt that is requesting service. (The value
0 implies that there is no such interrupt.)
— Servicing virtual interrupt (SVI). This is the high byte of the guest interrupt status. The processor treats
this value as the vector of the highest priority virtual interrupt that is in service. (The value 0 implies that
there is no such interrupt.)
See Chapter 29 for more information on the use of this field.
• PML index (16 bits). This field is supported only on processors that support the 1-setting of the “enable PML”
VM-execution control. It contains the logical index of the next entry in the page-modification log. Because the
page-modification log comprises 512 entries, the PML index is typically a value in the range 0–511. Details of
the page-modification log and use of the PML index are given in Section 28.2.6.
24-8 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
• Base-address fields for FS, GS, TR, GDTR, and IDTR (64 bits each; 32 bits on processors that do not support
Intel 64 architecture).
• The following MSRs:
— IA32_SYSENTER_CS (32 bits)
— IA32_SYSENTER_ESP and IA32_SYSENTER_EIP (64 bits; 32 bits on processors that do not support Intel 64
architecture).
— IA32_PERF_GLOBAL_CTRL (64 bits). This field is supported only on processors that support the 1-setting
of the “load IA32_PERF_GLOBAL_CTRL” VM-exit control.
— IA32_PAT (64 bits). This field is supported only on processors that support the 1-setting of the “load
IA32_PAT” VM-exit control.
— IA32_EFER (64 bits). This field is supported only on processors that support the 1-setting of the “load
IA32_EFER” VM-exit control.
— IA32_S_CET (64 bits; 32 bits on processors that do not support Intel 64 architecture). This field is
supported only on processors that support the 1-setting of the “load CET state” VM-exit control.
— IA32_INTERRUPT_SSP_TABLE_ADDR (64 bits; 32 bits on processors that do not support Intel 64 archi-
tecture). This field is supported only on processors that support the 1-setting of the “load CET state” VM-
exit control.
• The shadow-stack pointer register SSP (64 bits; 32 bits on processors that do not support Intel 64 archi-
tecture). This field is supported only on processors that support the 1-setting of the “load CET state” VM-exit
control.
In addition to the state identified here, some processor state components are loaded with fixed values on every
VM exit; there are no fields corresponding to these components in the host-state area. See Section 27.5 for details
of how state is loaded on VM exits.
1. Some asynchronous events cause VM exits regardless of the settings of the pin-based VM-execution controls (see Section 25.2).
Vol. 3C 24-9
VIRTUAL MACHINE CONTROL STRUCTURES
All other bits in this field are reserved, some to 0 and some to 1. Software should consult the VMX capability MSRs
IA32_VMX_PINBASED_CTLS and IA32_VMX_TRUE_PINBASED_CTLS (see Appendix A.3.1) to determine how to set
reserved bits. Failure to set reserved bits properly causes subsequent VM entries to fail (see Section 26.2.1.1).
The first processors to support the virtual-machine extensions supported only the 1-settings of bits 1, 2, and 4. The
VMX capability MSR IA32_VMX_PINBASED_CTLS will always report that these bits must be 1. Logical processors
that support the 0-settings of any of these bits will support the VMX capability MSR
IA32_VMX_TRUE_PINBASED_CTLS MSR, and software should consult this MSR to discover support for the 0-
settings of these bits. Software that is not aware of the functionality of any one of these bits should set that bit to 1.
1. Some instructions cause VM exits regardless of the settings of the processor-based VM-execution controls (see Section 25.1.2), as
do task switches (see Section 25.2).
24-10 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
All other bits in this field are reserved, some to 0 and some to 1. Software should consult the VMX capability MSRs
IA32_VMX_PROCBASED_CTLS and IA32_VMX_TRUE_PROCBASED_CTLS (see Appendix A.3.2) to determine how
to set reserved bits. Failure to set reserved bits properly causes subsequent VM entries to fail (see Section
26.2.1.1).
The first processors to support the virtual-machine extensions supported only the 1-settings of bits 1, 4–6, 8, 13–
16, and 26. The VMX capability MSR IA32_VMX_PROCBASED_CTLS will always report that these bits must be 1.
Logical processors that support the 0-settings of any of these bits will support the VMX capability MSR
IA32_VMX_TRUE_PROCBASED_CTLS MSR, and software should consult this MSR to discover support for the 0-
settings of these bits. Software that is not aware of the functionality of any one of these bits should set that bit to 1.
Bit 31 of the primary processor-based VM-execution controls determines whether the secondary processor-based
VM-execution controls are used. If that bit is 0, VM entry and VMX non-root operation function as if all the
secondary processor-based VM-execution controls were 0. Processors that support only the 0-setting of bit 31 of
the primary processor-based VM-execution controls do not support the secondary processor-based VM-execution
controls.
Table 24-7 lists the secondary processor-based VM-execution controls. See Chapter 25 for more details of how
these controls affect processor behavior in VMX non-root operation.
Vol. 3C 24-11
VIRTUAL MACHINE CONTROL STRUCTURES
24-12 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
All other bits in this field are reserved to 0. Software should consult the VMX capability MSR
IA32_VMX_PROCBASED_CTLS2 (see Appendix A.3.3) to determine which bits may be set to 1. Failure to clear
reserved bits causes subsequent VM entries to fail (see Section 26.2.1.1).
24.6.6 Guest/Host Masks and Read Shadows for CR0 and CR4
VM-execution control fields include guest/host masks and read shadows for the CR0 and CR4 registers. These
fields control executions of instructions that access those registers (including CLTS, LMSW, MOV CR, and SMSW).
They are 64 bits on processors that support Intel 64 architecture and 32 bits on processors that do not.
In general, bits set to 1 in a guest/host mask correspond to bits “owned” by the host:
• Guest attempts to set them (using CLTS, LMSW, or MOV to CR) to values differing from the corresponding bits
in the corresponding read shadow cause VM exits.
• Guest reads (using MOV from CR or SMSW) return values for these bits from the corresponding read shadow.
Bits cleared to 0 correspond to bits “owned” by the guest; guest attempts to modify them succeed and guest reads
return values for these bits from the control register itself.
See Chapter 27 for details regarding how these fields affect VMX non-root operation.
Vol. 3C 24-13
VIRTUAL MACHINE CONTROL STRUCTURES
1. If the local APIC does not support x2APIC mode, it is always in xAPIC mode.
24-14 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
The TPR threshold exists only on processors that support the 1-setting of the “use TPR shadow” VM-execution
control.
• EOI-exit bitmap (4 fields; 64 bits each). These fields are supported only on processors that support the 1-
setting of the “virtual-interrupt delivery” VM-execution control. They are used to determine which virtualized
writes to the APIC’s EOI register cause VM exits:
— EOI_EXIT0 contains bits for vectors from 0 (bit 0) to 63 (bit 63).
— EOI_EXIT1 contains bits for vectors from 64 (bit 0) to 127 (bit 63).
— EOI_EXIT2 contains bits for vectors from 128 (bit 0) to 191 (bit 63).
— EOI_EXIT3 contains bits for vectors from 192 (bit 0) to 255 (bit 63).
See Section 29.1.4 for more information on the use of this field.
• Posted-interrupt notification vector (16 bits). This field is supported only on processors that support the 1-
setting of the “process posted interrupts” VM-execution control. Its low 8 bits contain the interrupt vector that
is used to notify a logical processor that virtual interrupts have been posted. See Section 29.6 for more
information on the use of this field.
• Posted-interrupt descriptor address (64 bits). This field is supported only on processors that support the
1-setting of the “process posted interrupts” VM-execution control. It is the physical address of a 64-byte
aligned posted interrupt descriptor. See Section 29.6 for more information on the use of this field.
Vol. 3C 24-15
VIRTUAL MACHINE CONTROL STRUCTURES
Bit Field
Position(s)
5:3 This value is 1 less than the EPT page-walk length (see Section 28.2.2)
6 Setting this control to 1 enables accessed and dirty flags for EPT (see Section 28.2.5)2
7 Setting this control to 1 enables enforcement of access rights for supervisor shadow-stack pages (see Section
28.2.3.2)3
11:8 Reserved
N–1:12 Bits N–1:12 of the physical address of the 4-KByte aligned EPT PML4 table4
63:N Reserved
NOTES:
1. Software should read the VMX capability MSR IA32_VMX_EPT_VPID_CAP (see Appendix A.10) to determine what EPT paging-struc-
ture memory types are supported.
2. Not all processors support accessed and dirty flags for EPT. Software should read the VMX capability MSR
IA32_VMX_EPT_VPID_CAP (see Appendix A.10) to determine whether the processor supports this feature.
3. Not all processors enforce access rights for shadow-stack pages. Software should read the VMX capability MSR
IA32_VMX_EPT_VPID_CAP (see Appendix A.10) to determine whether the processor supports this feature.
4. N is the physical-address width supported by the logical processor. Software can determine a processor’s physical-address width by
executing CPUID with 80000008H in EAX. The physical-address width is returned in bits 7:0 of EAX.
The EPTP exists only on processors that support the 1-setting of the “enable EPT” VM-execution control.
24-16 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
All other bits in this field are reserved to 0. Software should consult the VMX capability MSR IA32_VMX_VMFUNC
(see Appendix A.11) to determine which bits are reserved. Failure to clear reserved bits causes subsequent
VM entries to fail (see Section 26.2.1.1).
Processors that support the 1-setting of the “EPTP switching” VM-function control also support a 64-bit field called
the EPTP-list address. This field contains the physical address of the 4-KByte EPTP list. The EPTP list comprises
512 8-Byte entries (each an EPTP value) and is used by the EPTP-switching VM function (see Section 25.5.5.3).
Vol. 3C 24-17
VIRTUAL MACHINE CONTROL STRUCTURES
Bit Field
Position(s)
11:0 Reserved
N–1:12 Bits N–1:12 of the physical address of the 4-KByte aligned root SPP table
63:N1 Reserved
NOTES:
1. N is the processor’s physical-address width. Software can determine this width by executing CPUID with 80000008H in EAX. The
physical-address width is returned in bits 7:0 of EAX.
The SPPTP exists only on processors that support the 1-setting of the “sub-page write permissions for EPT” VM-
execution control.
24-18 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
All other bits in this field are reserved, some to 0 and some to 1. Software should consult the VMX capability MSRs
IA32_VMX_EXIT_CTLS and IA32_VMX_TRUE_EXIT_CTLS (see Appendix A.4) to determine how it should set the
reserved bits. Failure to set reserved bits properly causes subsequent VM entries to fail (see Section 26.2.1.2).
The first processors to support the virtual-machine extensions supported only the 1-settings of bits 0–8, 10, 11,
13, 14, 16, and 17. The VMX capability MSR IA32_VMX_EXIT_CTLS always reports that these bits must be 1.
Logical processors that support the 0-settings of any of these bits will support the VMX capability MSR
IA32_VMX_TRUE_EXIT_CTLS MSR, and software should consult this MSR to discover support for the 0-settings of
these bits. Software that is not aware of the functionality of any one of these bits should set that bit to 1.
Vol. 3C 24-19
VIRTUAL MACHINE CONTROL STRUCTURES
• VM-exit MSR-store count (32 bits). This field specifies the number of MSRs to be stored on VM exit. It is
recommended that this count not exceed 512.1 Otherwise, unpredictable processor behavior (including a
machine check) may result during VM exit.
• VM-exit MSR-store address (64 bits). This field contains the physical address of the VM-exit MSR-store area.
The area is a table of entries, 16 bytes per entry, where the number of entries is given by the VM-exit MSR-store
count. The format of each entry is given in Table 24-12. If the VM-exit MSR-store count is not zero, the address
must be 16-byte aligned.
1. Future implementations may allow more MSRs to be stored reliably. Software should consult the VMX capability MSR
IA32_VMX_MISC to determine the number supported (see Appendix A.6).
2. Future implementations may allow more MSRs to be loaded reliably. Software should consult the VMX capability MSR
IA32_VMX_MISC to determine the number supported (see Appendix A.6).
24-20 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
NOTES:
1. Bit 5 of the IA32_VMX_MISC MSR is read as 1 on any logical processor that supports the 1-setting of the “unrestricted guest” VM-
execution control. If it is read as 1, every VM exit stores the value of IA32_EFER.LMA into the “IA-32e mode guest” VM-entry control
(see Section 27.2).
All other bits in this field are reserved, some to 0 and some to 1. Software should consult the VMX capability MSRs
IA32_VMX_ENTRY_CTLS and IA32_VMX_TRUE_ENTRY_CTLS (see Appendix A.5) to determine how it should set
the reserved bits. Failure to set reserved bits properly causes subsequent VM entries to fail (see Section 26.2.1.3).
The first processors to support the virtual-machine extensions supported only the 1-settings of bits 0–8 and 12.
The VMX capability MSR IA32_VMX_ENTRY_CTLS always reports that these bits must be 1. Logical processors that
support the 0-settings of any of these bits will support the VMX capability MSR IA32_VMX_TRUE_ENTRY_CTLS
MSR, and software should consult this MSR to discover support for the 0-settings of these bits. Software that is not
aware of the functionality of any one of these bits should set that bit to 1.
1. Future implementations may allow more MSRs to be loaded reliably. Software should consult the VMX capability MSR
IA32_VMX_MISC to determine the number supported (see Appendix A.6).
Vol. 3C 24-21
VIRTUAL MACHINE CONTROL STRUCTURES
• VM-entry interruption-information field (32 bits). This field provides details about the event to be injected.
Table 24-14 describes the field.
— The vector (bits 7:0) determines which entry in the IDT is used or which other event is injected.
— The interruption type (bits 10:8) determines details of how the injection is performed. In general, a VMM
should use the type hardware exception for all exceptions other than the following:
• breakpoint exceptions (#BP; a VMM should use the type software exception);
• overflow exceptions (#OF a VMM should use the use type software exception); and
• those debug exceptions (#DB) that are generated by INT1 (a VMM should use the use type privileged
software exception).1
The type other event is used for injection of events that are not delivered through the IDT.2
— For exceptions, the deliver-error-code bit (bit 11) determines whether delivery pushes an error code on
the guest stack.
— VM entry injects an event if and only if the valid bit (bit 31) is 1. The valid bit in this field is cleared on every
VM exit (see Section 27.2).
• VM-entry exception error code (32 bits). This field is used if and only if the valid bit (bit 31) and the deliver-
error-code bit (bit 11) are both set in the VM-entry interruption-information field.
• VM-entry instruction length (32 bits). For injection of events whose type is software interrupt, software
exception, or privileged software exception, this field is used to determine the value of RIP that is pushed on
the stack.
See Section 26.6 for details regarding the mechanics of event injection, including the use of the interruption type
and the VM-entry instruction length.
VM exits clear the valid bit (bit 31) in the VM-entry interruption-information field.
1. The type hardware exception should be used for all other debug exceptions.
2. INT1 and INT3 refer to the instructions with opcodes F1 and CC, respectively, and not to INT n with values 1 or 3 for n.
3. Software can discover whether these fields can be written by reading the VMX capability MSR IA32_VMX_MISC (see Appendix A.6).
24-22 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
16 Always cleared to 0
27 A VM exit saves this bit as 1 to indicate that the VM exit was incident to enclave mode.
30 Reserved (cleared to 0)
— Bits 15:0 provide basic information about the cause of the VM exit (if bit 31 is clear) or of the VM-entry
failure (if bit 31 is set). Appendix C enumerates the basic exit reasons.
— Bit 16 is always cleared to 0.
— Bit 27 is set to 1 if the VM exit occurred while the logical processor was in enclave mode.
A VM exit also sets this bit if it is incident to delivery of an event injected by VM entry and the guest inter-
ruptibility-state field indicates an enclave interrupt (bit 4 of the field is 1). See Section 27.2.1 for details.
— Bit 28 is set only by an SMM VM exit (see Section 34.15.2) that took priority over an MTF VM exit (see
Section 25.5.2) that would have occurred had the SMM VM exit not occurred. See Section 34.15.2.3.
— Bit 29 is set if and only if the processor was in VMX root operation at the time the VM exit occurred. This can
happen only for SMM VM exits. See Section 34.15.2.
— Because some VM-entry failures load processor state from the host-state area (see Section 26.8), software
must be able to distinguish such cases from true VM exits. Bit 31 is used for that purpose.
• Exit qualification (64 bits; 32 bits on processors that do not support Intel 64 architecture). This field contains
additional information about the cause of VM exits due to the following: debug exceptions; page-fault
exceptions; start-up IPIs (SIPIs); task switches; INVEPT; INVLPG;INVVPID; LGDT; LIDT; LLDT; LTR; SGDT;
SIDT; SLDT; STR; VMCLEAR; VMPTRLD; VMPTRST; VMREAD; VMWRITE; VMXON; XRSTORS; XSAVES; control-
register accesses; MOV DR; I/O instructions; and MWAIT. The format of the field depends on the cause of the
VM exit. See Section 27.2.1 for details.
• Guest-linear address (64 bits; 32 bits on processors that do not support Intel 64 architecture). This field is
used in the following cases:
— VM exits due to attempts to execute LMSW with a memory operand.
— VM exits due to attempts to execute INS or OUTS.
— VM exits due to system-management interrupts (SMIs) that arrive immediately after retirement of I/O
instructions.
— Certain VM exits due to EPT violations
See Section 27.2.1 and Section 34.15.2.3 for details of when and how this field is used.
• Guest-physical address (64 bits). This field is used VM exits due to EPT violations and EPT misconfigurations.
See Section 27.2.1 for details of when and how this field is used.
Vol. 3C 24-23
VIRTUAL MACHINE CONTROL STRUCTURES
• VM-exit interruption error code (32 bits). For VM exits caused by hardware exceptions that would have
delivered an error code on the stack, this field receives that error code.
Section 27.2.2 provides details of how these fields are saved on VM exits.
1. This includes cases in which the event delivery was caused by event injection as part of VM entry; see Section 26.6.1.2.
24-24 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
• IDT-vectoring error code (32 bits). For VM exits the occur during delivery of hardware exceptions that would
have delivered an error code on the stack, this field receives that error code.
See Section 27.2.4 provides details of how these fields are saved on VM exits.
1. This field is also used for VM exits that occur during the delivery of a software interrupt or software exception.
2. Whether the processor provides this information on VM exits due to attempts to execute INS or OUTS can be determined by con-
sulting the VMX capability MSR IA32_VMX_BASIC (see Appendix A.1).
Vol. 3C 24-25
VIRTUAL MACHINE CONTROL STRUCTURES
• The VMREAD and VMWRITE instructions can be used in VMX non-root operation to access a shadow VMCS but
not an ordinary VMCS. This fact results from the following:
— If the “VMCS shadowing” VM-execution control is 0, execution of the VMREAD and VMWRITE instructions in
VMX non-root operation always cause VM exits (see Section 25.1.3).
— If the “VMCS shadowing” VM-execution control is 1, execution of the VMREAD and VMWRITE instructions in
VMX non-root operation can access the VMCS referenced by the VMCS link pointer (see Section 30.3).
— If the “VMCS shadowing” VM-execution control is 1, VM entry ensures that any VMCS referenced by the
VMCS link pointer is a shadow VMCS (see Section 26.3.1.5).
In VMX root operation, both types of VMCSs can be accessed with the VMREAD and VMWRITE instructions.
Software should not modify the shadow-VMCS indicator in the VMCS region of a VMCS that is active. Doing so may
cause the VMCS to become corrupted (see Section 24.11.1). Before modifying the shadow-VMCS indicator, soft-
ware should execute VMCLEAR for the VMCS to ensure that it is not active.
1. As noted in Section 24.1, execution of the VMPTRLD instruction makes a VMCS is active. In addition, VM entry makes active any
shadow VMCS referenced by the VMCS link pointer in the current VMCS. If a shadow VMCS is made active by VM entry, it is neces-
sary to execute VMCLEAR for that VMCS before allowing that VMCS to become active on another logical processor.
24-26 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
This section has identified operations that may cause a VMCS to become corrupted. These operations may cause
the VMCS’s data to become undefined. Behavior may be unpredictable if that VMCS used subsequently on any
logical processor. The following items detail some hazards of VMCS corruption:
• VM entries may fail for unexplained reasons or may load undesired processor state.
• The processor may not correctly support VMX non-root operation as documented in Chapter 25 and may
generate unexpected VM exits.
• VM exits may load undesired processor state, save incorrect state into the VMCS, or cause the logical processor
to transition to a shutdown state.
0 Access type (0 = full; 1 = high); must be full for 16-bit, 32-bit, and natural-width fields
9:1 Index
11:10 Type:
0: control
1: VM-exit information
2: guest state
3: host state
12 Reserved (must be 0)
14:13 Width:
0: 16-bit
1: 64-bit
2: 32-bit
3: natural-width
The following items detail the meaning of the bits in each encoding:
• Field width. Bits 14:13 encode the width of the field.
— A value of 0 indicates a 16-bit field.
— A value of 1 indicates a 64-bit field.
— A value of 2 indicates a 32-bit field.
— A value of 3 indicates a natural-width field. Such fields have 64 bits on processors that support Intel 64
architecture and 32 bits on processors that do not.
Fields whose encodings use value 1 are specially treated to allow 32-bit software access to all 64 bits of the
field. Such access is allowed by defining, for each such field, an encoding that allows direct access to the high
32 bits of the field. See below.
• Field type. Bits 11:10 encode the type of VMCS field: control, guest-state, host-state, or VM-exit information.
(The last category also includes the VM-instruction error field.)
Vol. 3C 24-27
VIRTUAL MACHINE CONTROL STRUCTURES
• Index. Bits 9:1 distinguish components with the same field width and type.
• Access type. Bit 0 must be 0 for all fields except for 64-bit fields (those with field-width 1; see above). A
VMREAD or VMWRITE using an encoding with this bit cleared to 0 accesses the entire field. For a 64-bit field
with field-width 1, a VMREAD or VMWRITE using an encoding with this bit set to 1 accesses only the high 32 bits
of the field.
Appendix B gives the encodings of all fields in the VMCS.
The following describes the operation of VMREAD and VMWRITE based on processor mode, VMCS-field width, and
access type:
• 16-bit fields:
— A VMREAD returns the value of the field in bits 15:0 of the destination operand; other bits of the destination
operand are cleared to 0.
— A VMWRITE writes the value of bits 15:0 of the source operand into the VMCS field; other bits of the source
operand are not used.
• 32-bit fields:
— A VMREAD returns the value of the field in bits 31:0 of the destination operand; in 64-bit mode, bits 63:32
of the destination operand are cleared to 0.
— A VMWRITE writes the value of bits 31:0 of the source operand into the VMCS field; in 64-bit mode,
bits 63:32 of the source operand are not used.
• 64-bit fields and natural-width fields using the full access type outside IA-32e mode.
— A VMREAD returns the value of bits 31:0 of the field in its destination operand; bits 63:32 of the field are
ignored.
— A VMWRITE writes the value of its source operand to bits 31:0 of the field and clears bits 63:32 of the field.
• 64-bit fields and natural-width fields using the full access type in 64-bit mode (only on processors that support
Intel 64 architecture).
— A VMREAD returns the value of the field in bits 63:0 of the destination operand
— A VMWRITE writes the value of bits 63:0 of the source operand into the VMCS field.
• 64-bit fields using the high access type.
— A VMREAD returns the value of bits 63:32 of the field in bits 31:0 of the destination operand; in 64-bit
mode, bits 63:32 of the destination operand are cleared to 0.
— A VMWRITE writes the value of bits 31:0 of the source operand to bits 63:32 of the field; in 64-bit mode,
bits 63:32 of the source operand are not used.
Software seeking to read a 64-bit field outside IA-32e mode can use VMREAD with the full access type (reading
bits 31:0 of the field) and VMREAD with the high access type (reading bits 63:32 of the field); the order of the two
VMREAD executions is not important. Software seeking to modify a 64-bit field outside IA-32e mode should first
use VMWRITE with the full access type (establishing bits 31:0 of the field while clearing bits 63:32) and then use
VMWRITE with the high access type (establishing bits 63:32 of the field).
24-28 Vol. 3C
VIRTUAL MACHINE CONTROL STRUCTURES
In addition to its other functions, the VMCLEAR instruction initializes any implementation-specific information in
the VMCS region referenced by its operand. To avoid the uncertainties of implementation-specific behavior, soft-
ware should execute VMCLEAR on a VMCS region before making the corresponding VMCS active with VMPTRLD for
the first time. (Figure 24-1 illustrates how execution of VMCLEAR puts a VMCS into a well-defined state.)
The following software usage is consistent with these limitations:
• VMCLEAR should be executed for a VMCS before it is used for VM entry for the first time.
• VMLAUNCH should be used for the first VM entry using a VMCS after VMCLEAR has been executed for that
VMCS.
• VMRESUME should be used for any subsequent VM entry using a VMCS (until the next execution of VMCLEAR
for the VMCS).
It is expected that, in general, VMRESUME will have lower latency than VMLAUNCH. Since “migrating” a VMCS from
one logical processor to another requires use of VMCLEAR (see Section 24.11.1), which sets the launch state of the
VMCS to “clear”, such migration requires the next VM entry to be performed using VMLAUNCH. Software devel-
opers can avoid the performance cost of increased VM-entry latency by avoiding unnecessary migration of a VMCS
from one logical processor to another.
1. The amount of memory required for the VMXON region is the same as that required for a VMCS region. This size is implementation
specific and can be determined by consulting the VMX capability MSR IA32_VMX_BASIC (see Appendix A.1).
2. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
3. If IA32_VMX_BASIC[48] is read as 1, the VMXON pointer must not set any bits in the range 63:32; see Appendix A.1.
Vol. 3C 24-29
VIRTUAL MACHINE CONTROL STRUCTURES
24-30 Vol. 3C
20.Updates to Chapter 25, Volume 3C
Change bars and green text show changes to Chapter 25 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3C: System Programming Guide, Part 3.
------------------------------------------------------------------------------------------
Changes to this chapter: Minor update to Section 25.2 “Other Causes of VM Exits”.
In a virtualized environment using VMX, the guest software stack typically runs on a logical processor in VMX non-
root operation. This mode of operation is similar to that of ordinary processor operation outside of the virtualized
environment. This chapter describes the differences between VMX non-root operation and ordinary processor oper-
ation with special attention to causes of VM exits (which bring a logical processor from VMX non-root operation to
root operation). The differences between VMX non-root operation and ordinary processor operation are described
in the following sections:
• Section 25.1, “Instructions That Cause VM Exits”
• Section 25.2, “Other Causes of VM Exits”
• Section 25.3, “Changes to Instruction Behavior in VMX Non-Root Operation”
• Section 25.4, “Other Changes in VMX Non-Root Operation”
• Section 25.5, “Features Specific to VMX Non-Root Operation”
• Section 25.6, “Unrestricted Guests”
Chapter 26, “VM Entries,” describes the data control structures that govern VMX non-root operation. Chapter 26,
“VM Entries,” describes the operation of VM entries by which the processor transitions from VMX root operation to
VMX non-root operation. Chapter 25, “VMX Non-Root Operation,” describes the operation of VM exits by which the
processor transitions from VMX non-root operation to VMX root operation.
Chapter 28, “VMX Support for Address Translation,” describes two features that support address translation in VMX
non-root operation. Chapter 29, “APIC Virtualization and Virtual Interrupts,” describes features that support virtu-
alization of interrupts and the Advanced Programmable Interrupt Controller (APIC) in VMX non-root operation.
1. These include faults generated by attempts to execute, in virtual-8086 mode, privileged instructions that are not recognized in that
mode.
2. MOV DR is an exception to this rule; see Section 25.1.3.
Vol. 3C 25-1
VMX NON-ROOT OPERATION
— A general-protection fault due to the relevant segment (ES for INS; DS for OUTS unless overridden by an
instruction prefix) being unusable
— A general-protection fault due to an offset beyond the limit of the relevant segment
— An alignment-check exception
• Fault-like VM exits have priority over exceptions other than those mentioned above. For example, RDMSR of a
non-existent MSR with CPL = 0 generates a VM exit and not a general-protection exception.
When Section 25.1.2 or Section 25.1.3 (below) identify an instruction execution that may lead to a VM exit, it is
assumed that the instruction does not incur a fault that takes priority over a VM exit.
1. An execution of GETSEC in VMX non-root operation causes a VM exit if CR4.SMXE[Bit 14] = 1 regardless of the value of CPL or RAX.
An execution of GETSEC causes an invalid-opcode exception (#UD) if CR4.SMXE[Bit 14] = 0.
2. Under the dual-monitor treatment of SMIs and SMM, executions of VMCALL cause SMM VM exits in VMX root operation outside SMM.
See Section 34.15.2.
3. Many of the items in this section refer to secondary processor-based VM-execution controls. If bit 31 of the primary processor-
based VM-execution controls is 0, VMX non-root operation functions as if these controls were all 0. See Section 24.6.2.
25-2 Vol. 3C
VMX NON-ROOT OPERATION
causes a VM exit (the “unconditional I/O exiting” VM-execution control is ignored if the “use I/O bitmaps”
VM-execution control is 1).
See Section 25.1.1 for information regarding the priority of VM exits relative to faults that may be caused by
the INS and OUTS instructions.
• INVLPG. The INVLPG instruction causes a VM exit if the “INVLPG exiting” VM-execution control is 1.
• INVPCID. The INVPCID instruction causes a VM exit if the “INVLPG exiting” and “enable INVPCID”
VM-execution controls are both 1.
• LGDT, LIDT, LLDT, LTR, SGDT, SIDT, SLDT, STR. These instructions cause VM exits if the “descriptor-table
exiting” VM-execution control is 1.
• LMSW. In general, the LMSW instruction causes a VM exit if it would write, for any bit set in the low 4 bits of
the CR0 guest/host mask, a value different than the corresponding bit in the CR0 read shadow. LMSW never
clears bit 0 of CR0 (CR0.PE); thus, LMSW causes a VM exit if either of the following are true:
— The bits in position 0 (corresponding to CR0.PE) are set in both the CR0 guest/host mask and the source
operand, and the bit in position 0 is clear in the CR0 read shadow.
— For any bit position in the range 3:1, the bit in that position is set in the CR0 guest/host mask and the
values of the corresponding bits in the source operand and the CR0 read shadow differ.
• MONITOR. The MONITOR instruction causes a VM exit if the “MONITOR exiting” VM-execution control is 1.
• MOV from CR3. The MOV from CR3 instruction causes a VM exit if the “CR3-store exiting” VM-execution
control is 1. The first processors to support the virtual-machine extensions supported only the 1-setting of this
control.
• MOV from CR8. The MOV from CR8 instruction causes a VM exit if the “CR8-store exiting” VM-execution
control is 1.
• MOV to CR0. The MOV to CR0 instruction causes a VM exit unless the value of its source operand matches, for
the position of each bit set in the CR0 guest/host mask, the corresponding bit in the CR0 read shadow. (If every
bit is clear in the CR0 guest/host mask, MOV to CR0 cannot cause a VM exit.)
• MOV to CR3. The MOV to CR3 instruction causes a VM exit unless the “CR3-load exiting” VM-execution control
is 0 or the value of its source operand is equal to one of the CR3-target values specified in the VMCS. Only the
first n CR3-target values are considered, where n is the CR3-target count. If the “CR3-load exiting” VM-
execution control is 1 and the CR3-target count is 0, MOV to CR3 always causes a VM exit.
The first processors to support the virtual-machine extensions supported only the 1-setting of the “CR3-load
exiting” VM-execution control. These processors always consult the CR3-target controls to determine whether
an execution of MOV to CR3 causes a VM exit.
• MOV to CR4. The MOV to CR4 instruction causes a VM exit unless the value of its source operand matches, for
the position of each bit set in the CR4 guest/host mask, the corresponding bit in the CR4 read shadow.
• MOV to CR8. The MOV to CR8 instruction causes a VM exit if the “CR8-load exiting” VM-execution control is 1.
• MOV DR. The MOV DR instruction causes a VM exit if the “MOV-DR exiting” VM-execution control is 1. Such
VM exits represent an exception to the principles identified in Section 25.1.1 in that they take priority over the
following: general-protection exceptions based on privilege level; and invalid-opcode exceptions that occur
because CR4.DE=1 and the instruction specified access to DR4 or DR5.
• MWAIT. The MWAIT instruction causes a VM exit if the “MWAIT exiting” VM-execution control is 1. If this
control is 0, the behavior of the MWAIT instruction may be modified (see Section 25.3).
• PAUSE. The behavior of each of this instruction depends on CPL and the settings of the “PAUSE exiting” and
“PAUSE-loop exiting” VM-execution controls:
— CPL = 0.
• If the “PAUSE exiting” and “PAUSE-loop exiting” VM-execution controls are both 0, the PAUSE
instruction executes normally.
• If the “PAUSE exiting” VM-execution control is 1, the PAUSE instruction causes a VM exit (the “PAUSE-
loop exiting” VM-execution control is ignored if CPL = 0 and the “PAUSE exiting” VM-execution control
is 1).
Vol. 3C 25-3
VMX NON-ROOT OPERATION
• If the “PAUSE exiting” VM-execution control is 0 and the “PAUSE-loop exiting” VM-execution control is
1, the following treatment applies.
The processor determines the amount of time between this execution of PAUSE and the previous
execution of PAUSE at CPL 0. If this amount of time exceeds the value of the VM-execution control field
PLE_Gap, the processor considers this execution to be the first execution of PAUSE in a loop. (It also
does so for the first execution of PAUSE at CPL 0 after VM entry.)
Otherwise, the processor determines the amount of time since the most recent execution of PAUSE that
was considered to be the first in a loop. If this amount of time exceeds the value of the VM-execution
control field PLE_Window, a VM exit occurs.
For purposes of these computations, time is measured based on a counter that runs at the same rate as
the timestamp counter (TSC).
— CPL > 0.
• If the “PAUSE exiting” VM-execution control is 0, the PAUSE instruction executes normally.
• If the “PAUSE exiting” VM-execution control is 1, the PAUSE instruction causes a VM exit.
The “PAUSE-loop exiting” VM-execution control is ignored if CPL > 0.
• RDMSR. The RDMSR instruction causes a VM exit if any of the following are true:
— The “use MSR bitmaps” VM-execution control is 0.
— The value of ECX is not in the ranges 00000000H – 00001FFFH and C0000000H – C0001FFFH.
— The value of ECX is in the range 00000000H – 00001FFFH and bit n in read bitmap for low MSRs is 1, where
n is the value of ECX.
— The value of ECX is in the range C0000000H – C0001FFFH and bit n in read bitmap for high MSRs is 1,
where n is the value of ECX & 00001FFFH.
See Section 24.6.9 for details regarding how these bitmaps are identified.
• RDPMC. The RDPMC instruction causes a VM exit if the “RDPMC exiting” VM-execution control is 1.
• RDRAND. The RDRAND instruction causes a VM exit if the “RDRAND exiting” VM-execution control is 1.
• RDSEED. The RDSEED instruction causes a VM exit if the “RDSEED exiting” VM-execution control is 1.
• RDTSC. The RDTSC instruction causes a VM exit if the “RDTSC exiting” VM-execution control is 1.
• RDTSCP. The RDTSCP instruction causes a VM exit if the “RDTSC exiting” and “enable RDTSCP” VM-execution
controls are both 1.
• RSM. The RSM instruction causes a VM exit if executed in system-management mode (SMM).1
• TPAUSE. The TPAUSE instruction causes a VM exit if the “RDTSC exiting” and “enable user wait and pause”
VM-execution controls are both 1.
• UMWAIT. The UMWAIT instruction causes a VM exit if the “RDTSC exiting” and “enable user wait and pause”
VM-execution controls are both 1.
• VMREAD. The VMREAD instruction causes a VM exit if any of the following are true:
— The “VMCS shadowing” VM-execution control is 0.
— Bits 63:15 (bits 31:15 outside 64-bit mode) of the register source operand are not all 0.
— Bit n in VMREAD bitmap is 1, where n is the value of bits 14:0 of the register source operand. See Section
24.6.15 for details regarding how the VMREAD bitmap is identified.
If the VMREAD instruction does not cause a VM exit, it reads from the VMCS referenced by the VMCS link
pointer. See Chapter 30, “VMREAD—Read Field from Virtual-Machine Control Structure” for details of the
operation of the VMREAD instruction.
• VMWRITE. The VMWRITE instruction causes a VM exit if any of the following are true:
— The “VMCS shadowing” VM-execution control is 0.
1. Execution of the RSM instruction outside SMM causes an invalid-opcode exception regardless of whether the processor is in VMX
operation. It also does so in VMX root operation in SMM; see Section 34.15.3.
25-4 Vol. 3C
VMX NON-ROOT OPERATION
— Bits 63:15 (bits 31:15 outside 64-bit mode) of the register source operand are not all 0.
— Bit n in VMWRITE bitmap is 1, where n is the value of bits 14:0 of the register source operand. See Section
24.6.15 for details regarding how the VMWRITE bitmap is identified.
If the VMWRITE instruction does not cause a VM exit, it writes to the VMCS referenced by the VMCS link
pointer. See Chapter 30, “VMWRITE—Write Field to Virtual-Machine Control Structure” for details of the
operation of the VMWRITE instruction.
• WBINVD. The WBINVD instruction causes a VM exit if the “WBINVD exiting” VM-execution control is 1.
• WRMSR. The WRMSR instruction causes a VM exit if any of the following are true:
— The “use MSR bitmaps” VM-execution control is 0.
— The value of ECX is not in the ranges 00000000H – 00001FFFH and C0000000H – C0001FFFH.
— The value of ECX is in the range 00000000H – 00001FFFH and bit n in write bitmap for low MSRs is 1,
where n is the value of ECX.
— The value of ECX is in the range C0000000H – C0001FFFH and bit n in write bitmap for high MSRs is 1,
where n is the value of ECX & 00001FFFH.
See Section 24.6.9 for details regarding how these bitmaps are identified.
• XRSTORS. The XRSTORS instruction causes a VM exit if the “enable XSAVES/XRSTORS” VM-execution control
is 1and any bit is set in the logical-AND of the following three values: EDX:EAX, the IA32_XSS MSR, and the
XSS-exiting bitmap (see Section 24.6.20).
• XSAVES. The XSAVES instruction causes a VM exit if the “enable XSAVES/XRSTORS” VM-execution control is
1 and any bit is set in the logical-AND of the following three values: EDX:EAX, the IA32_XSS MSR, and the XSS-
exiting bitmap (see Section 24.6.20).
1. INT1 and INT3 refer to the instructions with opcodes F1 and CC, respectively, and not to INT n with value 1 or 3 for n.
Vol. 3C 25-5
VMX NON-ROOT OPERATION
• Non-maskable interrupts (NMIs). An NMI causes a VM exit if the “NMI exiting” VM-execution control is 1.
Otherwise, it is delivered using descriptor 2 of the IDT. (If a logical processor is in the wait-for-SIPI state, NMIs
are blocked. The NMI is not delivered through the IDT and no VM exit occurs.)
• INIT signals. INIT signals cause VM exits. A logical processor performs none of the operations normally
associated with these events. Such exits do not modify register state or clear pending events as they would
outside of VMX operation. (If a logical processor is in the wait-for-SIPI state, INIT signals are blocked. They do
not cause VM exits in this case.)
• Start-up IPIs (SIPIs). SIPIs cause VM exits. If a logical processor is not in the wait-for-SIPI activity state
when a SIPI arrives, no VM exit occurs and the SIPI is discarded. VM exits due to SIPIs do not perform any of
the normal operations associated with those events: they do not modify register state as they would outside of
VMX operation. (If a logical processor is not in the wait-for-SIPI state, SIPIs are blocked. They do not cause
VM exits in this case.)
• Task switches. Task switches are not allowed in VMX non-root operation. Any attempt to effect a task switch
in VMX non-root operation causes a VM exit. See Section 25.4.2.
• System-management interrupts (SMIs). If the logical processor is using the dual-monitor treatment of
SMIs and system-management mode (SMM), SMIs cause SMM VM exits. See Section 34.15.2.1
• VMX-preemption timer. A VM exit occurs when the timer counts down to zero. See Section 25.5.1 for details
of operation of the VMX-preemption timer.
Debug-trap exceptions and higher priority events take priority over VM exits caused by the VMX-preemption
timer. VM exits caused by the VMX-preemption timer take priority over VM exits caused by the “NMI-window
exiting” VM-execution control and lower priority events.
These VM exits wake a logical processor from the same inactive states as would a non-maskable interrupt.
Specifically, they wake a logical processor from the shutdown state and from the states entered using the HLT
and MWAIT instructions. These VM exits do not occur if the logical processor is in the wait-for-SIPI state.
In addition, there are controls that cause VM exits based on the readiness of guest software to receive interrupts:
• If the “interrupt-window exiting” VM-execution control is 1, a VM exit occurs before execution of any instruction
if RFLAGS.IF = 1 and there is no blocking of events by STI or by MOV SS (see Table 24-3). Such a VM exit
occurs immediately after VM entry if the above conditions are true (see Section 26.7.5).
Non-maskable interrupts (NMIs) and higher priority events take priority over VM exits caused by this control.
VM exits caused by this control take priority over external interrupts and lower priority events.
These VM exits wake a logical processor from the same inactive states as would an external interrupt. Specifi-
cally, they wake a logical processor from the states entered using the HLT and MWAIT instructions. These
VM exits do not occur if the logical processor is in the shutdown state or the wait-for-SIPI state.
• If the “NMI-window exiting” VM-execution control is 1, a VM exit occurs before execution of any instruction if
there is no virtual-NMI blocking and there is no blocking of events by MOV SS and no blocking of events by STI
(see Table 24-3). Such a VM exit occurs immediately after VM entry if the above conditions are true (see
Section 26.7.6).
VM exits caused by the VMX-preemption timer and higher priority events take priority over VM exits caused by
this control. VM exits caused by this control take priority over non-maskable interrupts (NMIs) and lower
priority events.
These VM exits wake a logical processor from the same inactive states as would an NMI. Specifically, they wake
a logical processor from the shutdown state and from the states entered using the HLT and MWAIT instructions.
These VM exits do not occur if the logical processor is in the wait-for-SIPI state.
1. Under the dual-monitor treatment of SMIs and SMM, SMIs also cause SMM VM exits if they occur in VMX root operation outside SMM.
If the processor is using the default treatment of SMIs and SMM, SMIs are delivered as described in Section 34.14.1.
25-6 Vol. 3C
VMX NON-ROOT OPERATION
• CLTS. Behavior of the CLTS instruction is determined by the bits in position 3 (corresponding to CR0.TS) in the
CR0 guest/host mask and the CR0 read shadow:
— If bit 3 in the CR0 guest/host mask is 0, CLTS clears CR0.TS normally (the value of bit 3 in the CR0 read
shadow is irrelevant in this case), unless CR0.TS is fixed to 1 in VMX operation (see Section 23.8), in which
case CLTS causes a general-protection exception.
— If bit 3 in the CR0 guest/host mask is 1 and bit 3 in the CR0 read shadow is 0, CLTS completes but does not
change the contents of CR0.TS.
— If the bits in position 3 in the CR0 guest/host mask and the CR0 read shadow are both 1, CLTS causes a
VM exit.
• INVPCID. Behavior of the INVPCID instruction is determined first by the setting of the “enable INVPCID”
VM-execution control:
— If the “enable INVPCID” VM-execution control is 0, INVPCID causes an invalid-opcode exception (#UD).
This exception takes priority over any other exception the instruction may incur.
— If the “enable INVPCID” VM-execution control is 1, treatment is based on the setting of the “INVLPG
exiting” VM-execution control:
• If the “INVLPG exiting” VM-execution control is 0, INVPCID operates normally.
• If the “INVLPG exiting” VM-execution control is 1, INVPCID causes a VM exit.
• IRET. Behavior of IRET with regard to NMI blocking (see Table 24-3) is determined by the settings of the “NMI
exiting” and “virtual NMIs” VM-execution controls:
— If the “NMI exiting” VM-execution control is 0, IRET operates normally and unblocks NMIs. (If the “NMI
exiting” VM-execution control is 0, the “virtual NMIs” control must be 0; see Section 26.2.1.1.)
— If the “NMI exiting” VM-execution control is 1, IRET does not affect blocking of NMIs. If, in addition, the
“virtual NMIs” VM-execution control is 1, the logical processor tracks virtual-NMI blocking. In this case,
IRET removes any virtual-NMI blocking.
The unblocking of NMIs or virtual NMIs specified above occurs even if IRET causes a fault.
• LMSW. Outside of VMX non-root operation, LMSW loads its source operand into CR0[3:0], but it does not clear
CR0.PE if that bit is set. In VMX non-root operation, an execution of LMSW that does not cause a VM exit (see
Section 25.1.3) leaves unmodified any bit in CR0[3:0] corresponding to a bit set in the CR0 guest/host mask.
An attempt to set any other bit in CR0[3:0] to a value not supported in VMX operation (see Section 23.8)
causes a general-protection exception. Attempts to clear CR0.PE are ignored without fault.
• MOV from CR0. The behavior of MOV from CR0 is determined by the CR0 guest/host mask and the CR0 read
shadow. For each position corresponding to a bit clear in the CR0 guest/host mask, the destination operand is
loaded with the value of the corresponding bit in CR0. For each position corresponding to a bit set in the CR0
guest/host mask, the destination operand is loaded with the value of the corresponding bit in the CR0 read
shadow. Thus, if every bit is cleared in the CR0 guest/host mask, MOV from CR0 reads normally from CR0; if
every bit is set in the CR0 guest/host mask, MOV from CR0 returns the value of the CR0 read shadow.
Depending on the contents of the CR0 guest/host mask and the CR0 read shadow, bits may be set in the
destination that would never be set when reading directly from CR0.
• MOV from CR3. If the “enable EPT” VM-execution control is 1 and an execution of MOV from CR3 does not
cause a VM exit (see Section 25.1.3), the value loaded from CR3 is a guest-physical address; see Section
28.2.1.
• MOV from CR4. The behavior of MOV from CR4 is determined by the CR4 guest/host mask and the CR4 read
shadow. For each position corresponding to a bit clear in the CR4 guest/host mask, the destination operand is
loaded with the value of the corresponding bit in CR4. For each position corresponding to a bit set in the CR4
guest/host mask, the destination operand is loaded with the value of the corresponding bit in the CR4 read
shadow. Thus, if every bit is cleared in the CR4 guest/host mask, MOV from CR4 reads normally from CR4; if
every bit is set in the CR4 guest/host mask, MOV from CR4 returns the value of the CR4 read shadow.
2. Some of the items in this section refer to secondary processor-based VM-execution controls. If bit 31 of the primary processor-
based VM-execution controls is 0, VMX non-root operation functions as if these controls were all 0. See Section 24.6.2.
Vol. 3C 25-7
VMX NON-ROOT OPERATION
Depending on the contents of the CR4 guest/host mask and the CR4 read shadow, bits may be set in the
destination that would never be set when reading directly from CR4.
• MOV from CR8. If the MOV from CR8 instruction does not cause a VM exit (see Section 25.1.3), its behavior is
modified if the “use TPR shadow” VM-execution control is 1; see Section 29.3.
• MOV to CR0. An execution of MOV to CR0 that does not cause a VM exit (see Section 25.1.3) leaves
unmodified any bit in CR0 corresponding to a bit set in the CR0 guest/host mask. Treatment of attempts to
modify other bits in CR0 depends on the setting of the “unrestricted guest” VM-execution control:
— If the control is 0, MOV to CR0 causes a general-protection exception if it attempts to set any bit in CR0 to
a value not supported in VMX operation (see Section 23.8).
— If the control is 1, MOV to CR0 causes a general-protection exception if it attempts to set any bit in CR0
other than bit 0 (PE) or bit 31 (PG) to a value not supported in VMX operation. It remains the case,
however, that MOV to CR0 causes a general-protection exception if it would result in CR0.PE = 0 and
CR0.PG = 1 or if it would result in CR0.PG = 1, CR4.PAE = 0, and IA32_EFER.LME = 1.
• MOV to CR3. If the “enable EPT” VM-execution control is 1 and an execution of MOV to CR3 does not cause a
VM exit (see Section 25.1.3), the value loaded into CR3 is treated as a guest-physical address; see Section
28.2.1.
— If PAE paging is not being used, the instruction does not use the guest-physical address to access memory
and it does not cause it to be translated through EPT.1
— If PAE paging is being used, the instruction translates the guest-physical address through EPT and uses the
result to load the four (4) page-directory-pointer-table entries (PDPTEs). The instruction does not use the
guest-physical addresses the PDPTEs to access memory and it does not cause them to be translated
through EPT.
• MOV to CR4. An execution of MOV to CR4 that does not cause a VM exit (see Section 25.1.3) leaves
unmodified any bit in CR4 corresponding to a bit set in the CR4 guest/host mask. Such an execution causes a
general-protection exception if it attempts to set any bit in CR4 (not corresponding to a bit set in the CR4
guest/host mask) to a value not supported in VMX operation (see Section 23.8).
• MOV to CR8. If the MOV to CR8 instruction does not cause a VM exit (see Section 25.1.3), its behavior is
modified if the “use TPR shadow” VM-execution control is 1; see Section 29.3.
• MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-opcode exception—#UD—if
CPL > 0) is determined by the setting of the “MWAIT exiting” VM-execution control:
— If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit.
— If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if one of the following are true:
(1) ECX[0] is 0; (2) RFLAGS.IF = 1; or both of the following are true: (a) the “interrupt-window exiting” VM-
execution control is 0; and (b) the logical processor has not recognized a pending virtual interrupt (see
Section 29.2.1).
— If the “MWAIT exiting” VM-execution control is 0, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT does not cause
the processor to enter an implementation-dependent optimized state if either the “interrupt-window
exiting” VM-execution control is 1 or the logical processor has recognized a pending virtual interrupt;
instead, control passes to the instruction following the MWAIT instruction.
• RDMSR. Section 25.1.3 identifies when executions of the RDMSR instruction cause VM exits. If such an
execution causes neither a fault due to CPL > 0 nor a VM exit, the instruction’s behavior may be modified for
certain values of ECX:
— If ECX contains 10H (indicating the IA32_TIME_STAMP_COUNTER MSR), the value returned by the
instruction is determined by the setting of the “use TSC offsetting” VM-execution control:
• If the control is 0, RDMSR operates normally, loading EAX:EDX with the value of the
IA32_TIME_STAMP_COUNTER MSR.
• If the control is 1, the value returned is determined by the setting of the “use TSC scaling” VM-execution
control:
1. A logical processor uses PAE paging if CR0.PG = 1, CR4.PAE = 1 and IA32_EFER.LMA = 0. See Section 4.4 in the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3A.
25-8 Vol. 3C
VMX NON-ROOT OPERATION
— If the control is 0, RDMSR loads EAX:EDX with the sum of the value of the
IA32_TIME_STAMP_COUNTER MSR and the value of the TSC offset.
— If the control is 1, RDMSR first computes the product of the value of the
IA32_TIME_STAMP_COUNTER MSR and the value of the TSC multiplier. It then shifts the value of
the product right 48 bits and loads EAX:EDX with the sum of that shifted value and the value of the
TSC offset.
The 1-setting of the “use TSC-offsetting” VM-execution control does not affect executions of RDMSR if ECX
contains 6E0H (indicating the IA32_TSC_DEADLINE MSR). Such executions return the APIC-timer deadline
relative to the actual timestamp counter without regard to the TSC offset.
— If ECX is in the range 800H–8FFH (indicating an APIC MSR), instruction behavior may be modified if the
“virtualize x2APIC mode” VM-execution control is 1; see Section 29.5.
• RDPID. Behavior of the RDPID instruction is determined first by the setting of the “enable RDTSCP”
VM-execution control:
— If the “enable RDTSCP” VM-execution control is 0, RDPID causes an invalid-opcode exception (#UD).
— If the “enable RDTSCP” VM-execution control is 1, RDPID operates normally.
• RDTSC. Behavior of the RDTSC instruction is determined by the settings of the “RDTSC exiting” and “use TSC
offsetting” VM-execution controls:
— If both controls are 0, RDTSC operates normally.
— If the “RDTSC exiting” VM-execution control is 0 and the “use TSC offsetting” VM-execution control is 1, the
value returned is determined by the setting of the “use TSC scaling” VM-execution control:
• If the control is 0, RDTSC loads EAX:EDX with the sum of the value of the IA32_TIME_STAMP_COUNTER
MSR and the value of the TSC offset.
• If the control is 1, RDTSC first computes the product of the value of the IA32_TIME_STAMP_COUNTER
MSR and the value of the TSC multiplier. It then shifts the value of the product right 48 bits and loads
EAX:EDX with the sum of that shifted value and the value of the TSC offset.
— If the “RDTSC exiting” VM-execution control is 1, RDTSC causes a VM exit.
• RDTSCP. Behavior of the RDTSCP instruction is determined first by the setting of the “enable RDTSCP”
VM-execution control:
— If the “enable RDTSCP” VM-execution control is 0, RDTSCP causes an invalid-opcode exception (#UD). This
exception takes priority over any other exception the instruction may incur.
— If the “enable RDTSCP” VM-execution control is 1, treatment is based on the settings of the “RDTSC exiting”
and “use TSC offsetting” VM-execution controls:
• If both controls are 0, RDTSCP operates normally.
• If the “RDTSC exiting” VM-execution control is 0 and the “use TSC offsetting” VM-execution control is 1,
the value returned is determined by the setting of the “use TSC scaling” VM-execution control:
— If the control is 0, RDTSCP loads EAX:EDX with the sum of the value of the
IA32_TIME_STAMP_COUNTER MSR and the value of the TSC offset.
— If the control is 1, RDTSCP first computes the product of the value of the
IA32_TIME_STAMP_COUNTER MSR and the value of the TSC multiplier. It then shifts the value of
the product right 48 bits and loads EAX:EDX with the sum of that shifted value and the value of the
TSC offset.
In either case, RDTSCP also loads ECX with the value of bits 31:0 of the IA32_TSC_AUX MSR.
• If the “RDTSC exiting” VM-execution control is 1, RDTSCP causes a VM exit.
• SMSW. The behavior of SMSW is determined by the CR0 guest/host mask and the CR0 read shadow. For each
position corresponding to a bit clear in the CR0 guest/host mask, the destination operand is loaded with the
value of the corresponding bit in CR0. For each position corresponding to a bit set in the CR0 guest/host mask,
the destination operand is loaded with the value of the corresponding bit in the CR0 read shadow. Thus, if every
bit is cleared in the CR0 guest/host mask, SMSW reads normally from CR0; if every bit is set in the CR0
guest/host mask, SMSW returns the value of the CR0 read shadow.
Vol. 3C 25-9
VMX NON-ROOT OPERATION
Note the following: (1) for any memory destination or for a 16-bit register destination, only the low 16 bits of
the CR0 guest/host mask and the CR0 read shadow are used (bits 63:16 of a register destination are left
unchanged); (2) for a 32-bit register destination, only the low 32 bits of the CR0 guest/host mask and the CR0
read shadow are used (bits 63:32 of the destination are cleared); and (3) depending on the contents of the
CR0 guest/host mask and the CR0 read shadow, bits may be set in the destination that would never be set
when reading directly from CR0.
• TPAUSE. Behavior of the TPAUSE instruction is determined first by the setting of the “enable user wait and
pause” VM-execution control:
— If the “enable user wait and pause” VM-execution control is 0, TPAUSE causes an invalid-opcode exception
(#UD). This exception takes priority over any exception the instruction may incur.
— If the “enable user wait and pause” VM-execution control is 1, treatment is based on the setting of the
“RDTSC exiting” VM-execution control:
• If the “RDTSC exiting” VM-execution control is 0, the instruction delays for an amount of time called
here the physical delay. The physical delay is first computed by determining the virtual delay (the
time to delay relative to the guest’s timestamp counter).
If IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in EDX:EAX minus the value
that RDTSC would return (see above); if IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay
is the minimum of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH).
The physical delay depends upon the settings of the “use TSC offsetting” and “use TSC scaling”
VM-execution controls:
— If either control is 0, the physical delay is the virtual delay.
— If both controls are 1, the virtual delay is multiplied by 248 (using a shift) to produce a 128-bit
integer. That product is then divided by the TSC multiplier to produce a 64-bit integer. The physical
delay is that quotient.
• If the “RDTSC exiting” VM-execution control is 1, TPAUSE causes a VM exit.
• UMONITOR. Behavior of the UMONITOR instruction is determined by the setting of the “enable user wait and
pause” VM-execution control:
— If the “enable user wait and pause” VM-execution control is 0, UMONITOR causes an invalid-opcode
exception (#UD). This exception takes priority over any exception the instruction may incur.
— If the “enable user wait and pause” VM-execution control is 1, UMONITOR operates normally.
• UMWAIT. Behavior of the UMWAIT instruction is determined first by the setting of the “enable user wait and
pause” VM-execution control:
— If the “enable user wait and pause” VM-execution control is 0, UMWAIT causes an invalid-opcode exception
(#UD). This exception takes priority over any exception the instruction may incur.
— If the “enable user wait and pause” VM-execution control is 1, treatment is based on the setting of the
“RDTSC exiting” VM-execution control:
• If the “RDTSC exiting” VM-execution control is 0, and if the instruction causes a delay, the amount of
time delayed is called here the physical delay. The physical delay is first computed by determining the
virtual delay (the time to delay relative to the guest’s timestamp counter).
If IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in EDX:EAX minus the value
that RDTSC would return (see above); if IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay
is the minimum of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH).
The physical delay depends upon the settings of the “use TSC offsetting” and “use TSC scaling”
VM-execution controls:
— If either control is 0, the physical delay is the virtual delay.
— If both controls are 1, the virtual delay is multiplied by 248 (using a shift) to produce a 128-bit
integer. That product is then divided by the TSC multiplier to produce a 64-bit integer. The physical
delay is that quotient.
• If the “RDTSC exiting” VM-execution control is 1, UMWAIT causes a VM exit.
25-10 Vol. 3C
VMX NON-ROOT OPERATION
• WRMSR. Section 25.1.3 identifies when executions of the WRMSR instruction cause VM exits. If such an
execution neither a fault due to CPL > 0 nor a VM exit, the instruction’s behavior may be modified for certain
values of ECX:
— If ECX contains 79H (indicating IA32_BIOS_UPDT_TRIG MSR), no microcode update is loaded, and control
passes to the next instruction. This implies that microcode updates cannot be loaded in VMX non-root
operation.
— On processors that support Intel PT but which do not allow it to be used in VMX operation, if ECX contains
570H (indicating the IA32_RTIT_CTL MSR), the instruction causes a general-protection exception.1
— If ECX contains 808H (indicating the TPR MSR), 80BH (the EOI MSR), or 83FH (self-IPI MSR), instruction
behavior may modified if the “virtualize x2APIC mode” VM-execution control is 1; see Section 29.5.
• XRSTORS. Behavior of the XRSTORS instruction is determined first by the setting of the “enable
XSAVES/XRSTORS” VM-execution control:
— If the “enable XSAVES/XRSTORS” VM-execution control is 0, XRSTORS causes an invalid-opcode exception
(#UD).
— If the “enable XSAVES/XRSTORS” VM-execution control is 1, treatment is based on the value of the XSS-
exiting bitmap (see Section 24.6.20):
• XRSTORS causes a VM exit if any bit is set in the logical-AND of the following three values: EDX:EAX,
the IA32_XSS MSR, and the XSS-exiting bitmap.
• Otherwise, XRSTORS operates normally.
• XSAVES. Behavior of the XSAVES instruction is determined first by the setting of the “enable
XSAVES/XRSTORS” VM-execution control:
— If the “enable XSAVES/XRSTORS” VM-execution control is 0, XSAVES causes an invalid-opcode exception
(#UD).
— If the “enable XSAVES/XRSTORS” VM-execution control is 1, treatment is based on the value of the XSS-
exiting bitmap (see Section 24.6.20):
• XSAVES causes a VM exit if any bit is set in the logical-AND of the following three values: EDX:EAX, the
IA32_XSS MSR, and the XSS-exiting bitmap.
• Otherwise, XSAVES operates normally.
1. Software should read the VMX capability MSR IA32_VMX_MISC to determine whether the processor allows Intel PT to be used in
VMX operation (see Appendix A.6).
Vol. 3C 25-11
VMX NON-ROOT OPERATION
25-12 Vol. 3C
VMX NON-ROOT OPERATION
Vol. 3C 25-13
VMX NON-ROOT OPERATION
— If the instruction causes a fault, an MTF VM exit is pending on the instruction boundary following delivery of
the fault (or any nested exception).1
— If the instruction does not cause a fault, an MTF VM exit is pending on the instruction boundary following
execution of that instruction. If the instruction is INT1, INT3, or INTO, this boundary follows delivery of any
software exception. If the instruction is INT n, this boundary follows delivery of a software interrupt. If the
instruction is HLT, the MTF VM exit will be from the HLT activity state.
No MTF VM exit occurs if another VM exit occurs before reaching the instruction boundary on which an MTF VM exit
would be pending (e.g., due to an exception or triple fault).
An MTF VM exit occurs on the instruction boundary on which it is pending unless a higher priority event takes
precedence or the MTF VM exit is blocked due to the activity state:
• System-management interrupts (SMIs), INIT signals, and higher priority events take priority over MTF
VM exits. MTF VM exits take priority over debug-trap exceptions and lower priority events.
• No MTF VM exit occurs if the processor is in either the shutdown activity state or wait-for-SIPI activity state. If
a non-maskable interrupt subsequently takes the logical processor out of the shutdown activity state without
causing a VM exit, an MTF VM exit is pending after delivery of that interrupt.
Special treatment may apply to Intel SGX instructions or if the logical processor is in enclave mode. See Section
42.2 for details.
1. This item includes the cases of an invalid opcode exception—#UD— generated by the UD0, UD1, and UD2 instructions and a BOUND-
range exceeded exception—#BR—generated by the BOUND instruction.
25-14 Vol. 3C
VMX NON-ROOT OPERATION
To prevent the loss of buffered trace data, the processor uses a mechanism called trace-address pre-translation
(TAPT). With TAPT, the processor translates using EPT the guest-physical address of the current output region
before that address would be used to write buffered trace data to memory.
Because of TAPT, no translation (and thus no EPT violation) occurs at the time output is written to memory; the
writes to memory use translations that were cached as part of TAPT. (The details given in Section 25.5.3.1 apply to
TAPT.) TAPT ensures that, if a write to the output region would cause an EPT violation, the resulting VM exit is deliv-
ered at the time of TAPT, before the region would be used. This allows software to resolve the EPT violation at that
time and ensures that, when it is necessary to write buffered trace data to memory, that data will not be lost due
to an EPT violation.
TAPT (and resulting VM exits) may occur at any of the following times:
• When software in VMX non-root operation enables tracing by loading the IA32_RTIT_CTL MSR to set the
TraceEn bit, using the WRMSR instruction or the XRSTORS instruction.
Any VM exit resulting from TAPT in this case is trap-like: the WRMSR or XRSTORS completes before the
VM exit occurs (for example, the value of CS:RIP saved in the guest-state area of the VMCS references the
next instruction).
• At an instruction boundary when one output region becomes full and Intel PT transitions to the next output
region.
VM exits resulting from TAPT in this case take priority over any pending debug exceptions. Such a VM exit will
save information about such exceptions in the guest-state area of the VMCS.
• As part of a VM entry that enables Intel PT. See Section 26.5 for details.
TAPT may translate not only the guest-physical address of the current output region but those of subsequent
output regions as well. (Doing so may provide better protection of trace data.) This implies that any VM exits
resulting from TAPT may result from the translation of output-region addresses other than that of the current
output region.
25.5.5 VM Functions
A VM function is an operation provided by the processor that can be invoked from VMX non-root operation
without a VM exit. VM functions are enabled and configured by the settings of different fields in the VMCS. Soft-
ware in VMX non-root operation invokes a VM function with the VMFUNC instruction; the value of EAX selects the
specific VM function being invoked.
Section 25.5.5.1 explains how VM functions are enabled. Section 25.5.5.2 specifies the behavior of the VMFUNC
instruction. Section 25.5.5.3 describes a specific VM function called EPTP switching.
Vol. 3C 25-15
VMX NON-ROOT OPERATION
IF ECX ≥ 512
THEN VM exit;
ELSE
tent_EPTP ← 8 bytes from EPTP-list address + 8 * ECX;
IF tent_EPTP is not a valid EPTP value (would cause VM entry to fail if in EPTP)
THEN VM exit;
ELSE
write tent_EPTP to the EPTP field in the current VMCS;
use tent_EPTP as the new EPTP value for address translation;
IF processor supports the 1-setting of the “EPT-violation #VE” VM-execution control
THEN
write ECX[15:0] to EPTP-index field in current VMCS;
use ECX[15:0] as EPTP index for subsequent EPT-violation virtualization exceptions (see Section 25.5.6.2);
FI;
FI;
FI;
Execution of the EPTP-switching VM function does not modify the state of any registers; no flags are modified.
If the “Intel PT uses guest physical addresses” VM-execution control is 1 and IA32_RTIT_CTL.TraceEn = 1, any
execution of the EPTP-switching VM function causes a VM exit.2
1. “Enable VM functions” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VMX non-root operation functions as if the “enable VM functions” VM-execution control were 0. See Section 24.6.2.
2. Such a VM exit ensures the proper recording of trace data that might otherwise be lost during the change of EPT paging-structure
hierarchy. Software handling the VM exit can change emulate the VM function and then resume the guest.
25-16 Vol. 3C
VMX NON-ROOT OPERATION
As noted in Section 25.5.5.2, an execution of the EPTP-switching VM function that causes a VM exit (as specified
above), uses the basic exit reason 59, indicating “VMFUNC”. The length of the VMFUNC instruction is saved into the
VM-exit instruction-length field. No additional VM-exit information is provided.
An execution of VMFUNC loads EPTP from the EPTP list (and thus does not cause a fault or VM exit) is called an
EPTP-switching VMFUNC. After an EPTP-switching VMFUNC, control passes to the next instruction. The logical
processor starts creating and using guest-physical and combined mappings associated with the new value of bits
51:12 of EPTP; the combined mappings created and used are associated with the current VPID and PCID (these are
not changed by VMFUNC).1 If the “enable VPID” VM-execution control is 0, an EPTP-switching VMFUNC invalidates
combined mappings associated with VPID 0000H (for all PCIDs and for all EP4TA values, where EP4TA is the value
of bits 51:12 of EPTP).
Because an EPTP-switching VMFUNC may change the translation of guest-physical addresses, it may affect use of
the guest-physical address in CR3. The EPTP-switching VMFUNC cannot itself cause a VM exit due to an EPT viola-
tion or an EPT misconfiguration due to the translation of that guest-physical address through the new EPT paging
structures. The following items provide details that apply if CR0.PG = 1:
• If 32-bit paging or 4-level paging2 is in use (either CR4.PAE = 0 or IA32_EFER.LMA = 1), the next memory
access with a linear address uses the translation of the guest-physical address in CR3 through the new EPT
paging structures. As a result, this access may cause a VM exit due to an EPT violation or an EPT misconfigu-
ration encountered during that translation.
• If PAE paging is in use (CR4.PAE = 1 and IA32_EFER.LMA = 0), an EPTP-switching VMFUNC does not load the
four page-directory-pointer-table entries (PDPTEs) from the guest-physical address in CR3. The logical
processor continues to use the four guest-physical addresses already present in the PDPTEs. The guest-
physical address in CR3 is not translated through the new EPT paging structures (until some operation that
would load the PDPTEs).
The EPTP-switching VMFUNC cannot itself cause a VM exit due to an EPT violation or an EPT misconfiguration
encountered during the translation of a guest-physical address in any of the PDPTEs. A subsequent memory
access with a linear address uses the translation of the guest-physical address in the appropriate PDPTE
through the new EPT paging structures. As a result, such an access may cause a VM exit due to an EPT
violation or an EPT misconfiguration encountered during that translation.
If an EPTP-switching VMFUNC establishes an EPTP value that enables accessed and dirty flags for EPT (by setting
bit 6), subsequent memory accesses may fail to set those flags as specified if there has been no appropriate execu-
tion of INVEPT since the last use of an EPTP value that does not enable accessed and dirty flags for EPT (because
bit 6 is clear) and that is identical to the new value on bits 51:12.
IF the processor supports the 1-setting of the “EPT-violation #VE” VM-execution control, an EPTP-switching
VMFUNC loads the value in ECX[15:0] into to EPTP-index field in current VMCS. Subsequent EPT-violation virtual-
ization exceptions will save this value into the virtualization-exception information area (see Section 25.5.6.2);
1. If the “enable VPID” VM-execution control is 0, the current VPID is 0000H; if CR4.PCIDE = 0, the current PCID is 000H.
2. Earlier versions of this manual used the term “IA-32e paging” to identify 4-level paging.
Vol. 3C 25-17
VMX NON-ROOT OPERATION
0 The 32-bit value that would have been saved into the VMCS as an exit reason had a VM exit occurred
instead of the virtualization exception. For EPT violations, this value is 48 (00000030H)
4 FFFFFFFFH
8 The 64-bit value that would have been saved into the VMCS as an exit qualification had a VM exit
occurred instead of the virtualization exception
1. If the “mode-based execute control for EPT” VM-execution control is 1, an EPT paging-structure entry is present if any of bits 2:0 or
bit 10 is 1.
25-18 Vol. 3C
VMX NON-ROOT OPERATION
16 The 64-bit value that would have been saved into the VMCS as a guest-linear address had a VM exit
occurred instead of the virtualization exception
24 The 64-bit value that would have been saved into the VMCS as a guest-physical address had a VM
exit occurred instead of the virtualization exception
32 The current 16-bit value of the EPTP index VM-execution control (see Section 24.6.19 and Section
25.5.5.3)
1. “Unrestricted guest” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VMX non-root operation functions as if the “unrestricted guest” VM-execution control were 0. See Section 24.6.2.
Vol. 3C 25-19
VMX NON-ROOT OPERATION
and exceptions that cause VM exits in protected mode or when paging is enabled also do so in real-address
mode or when paging is disabled. The following examples should be noted:
— If CR0.PG = 0, page faults do not occur and thus cannot cause VM exits.
— If CR0.PE = 0, invalid-TSS exceptions do not occur and thus cannot cause VM exits.
— If CR0.PE = 0, the following instructions cause invalid-opcode exceptions and do not cause VM exits:
INVEPT, INVVPID, LLDT, LTR, SLDT, STR, VMCLEAR, VMLAUNCH, VMPTRLD, VMPTRST, VMREAD,
VMRESUME, VMWRITE, VMXOFF, and VMXON.
• If CR0.PG = 0, each linear address is passed directly to the EPT mechanism for translation to a physical
address.1 The guest memory type passed on to the EPT mechanism is WB (writeback).
1. As noted in Section 26.2.1.1, the “enable EPT” VM-execution control must be 1 if the “unrestricted guest” VM-execution control is 1.
25-20 Vol. 3C
21.Updates to Chapter 26, Volume 3C
Change bars and green text show changes to Chapter 26 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3C: System Programming Guide, Part 3.
------------------------------------------------------------------------------------------
Changes to this chapter: Minor updates to Section 26.2.1.1 “VM-Execution Control Fields”, Section 26.3.1.5
“Checks on Guest Non-Register State”, and Section 26.8 “VM-Entry Failures During or After Loading Guest
State”. Added Control-flow Enforcement Technology details.
Software can enter VMX non-root operation using either of the VM-entry instructions VMLAUNCH and VMRESUME.
VMLAUNCH can be used only with a VMCS whose launch state is clear and VMRESUME can be used only with a
VMCS whose the launch state is launched. VMLAUNCH should be used for the first VM entry after VMCLEAR; VMRE-
SUME should be used for subsequent VM entries with the same VMCS.
Each VM entry performs the following steps in the order indicated:
1. Basic checks are performed to ensure that VM entry can commence
(Section 26.1).
2. The control and host-state areas of the VMCS are checked to ensure that they are proper for supporting VMX
non-root operation and that the VMCS is correctly configured to support the next VM exit (Section 26.2).
3. The following may be performed in parallel or in any order (Section 26.3):
• The guest-state area of the VMCS is checked to ensure that, after the VM entry completes, the state of the
logical processor is consistent with IA-32 and Intel 64 architectures.
• Processor state is loaded from the guest-state area and based on controls in the VMCS.
• Address-range monitoring is cleared.
4. MSRs are loaded from the VM-entry MSR-load area (Section 26.4).
5. If VMLAUNCH is being executed, the launch state of the VMCS is set to “launched.”
6. If the “Intel PT uses guest physical addresses” VM-execution control is 1, trace-address pre-translation (TAPT)
may occur (see Section 25.5.4 and Section 26.5).
7. An event may be injected in the guest context (Section 26.6).
Steps 1–4 above perform checks that may cause VM entry to fail. Such failures occur in one of the following three
ways:
• Some of the checks in Section 26.1 may generate ordinary faults (for example, an invalid-opcode exception).
Such faults are delivered normally.
• Some of the checks in Section 26.1 and all the checks in Section 26.2 cause control to pass to the instruction
following the VM-entry instruction. The failure is indicated by setting RFLAGS.ZF1 (if there is a current VMCS)
or RFLAGS.CF (if there is no current VMCS). If there is a current VMCS, an error number indicating the cause of
the failure is stored in the VM-instruction error field. See Chapter 30 for the error numbers.
• The checks in Section 26.3 and Section 26.4 cause processor state to be loaded from the host-state area of the
VMCS (as would be done on a VM exit). Information about the failure is stored in the VM-exit information fields.
See Section 26.8 for details.
EFLAGS.TF = 1 causes a VM-entry instruction to generate a single-step debug exception only if failure of one of the
checks in Section 26.1 and Section 26.2 causes control to pass to the following instruction. A VM-entry does not
generate a single-step debug exception in any of the following cases: (1) the instruction generates a fault; (2)
failure of one of the checks in Section 26.3 or in loading MSRs causes processor state to be loaded from the host-
state area of the VMCS; or (3) the instruction passes all checks in Section 26.1, Section 26.2, and Section 26.3 and
there is no failure in loading MSRs.
Section 34.15 describes the dual-monitor treatment of system-management interrupts (SMIs) and system-
management mode (SMM). Under this treatment, code running in SMM returns using VM entries instead of the RSM
instruction. A VM entry returns from SMM if it is executed in SMM and the “entry to SMM” VM-entry control is 0.
VM entries that return from SMM differ from ordinary VM entries in ways that are detailed in Section 34.15.4.
1. This chapter uses the notation RAX, RIP, RSP, RFLAGS, etc. for processor registers because most processors that support VMX oper-
ation also support Intel 64 architecture. For IA-32 processors, this notation refers to the 32-bit forms of those registers (EAX, EIP,
ESP, EFLAGS, etc.). In a few places, notation such as EAX is used to refer specifically to lower 32 bits of the indicated register.
Vol. 3C 26-1
VM ENTRIES
1. If the “activate secondary controls” primary processor-based VM-execution control is 0, VM entry operates as if each secondary pro-
cessor-based VM-execution control were 0.
26-2 Vol. 3C
VM ENTRIES
• Reserved bits in the primary processor-based VM-execution controls must be set properly. Software may
consult the VMX capability MSRs to determine the proper settings (see Appendix A.3.2).
• If the “activate secondary controls” primary processor-based VM-execution control is 1, reserved bits in the
secondary processor-based VM-execution controls must be cleared. Software may consult the VMX capability
MSRs to determine which bits are reserved (see Appendix A.3.3).
If the “activate secondary controls” primary processor-based VM-execution control is 0 (or if the processor
does not support the 1-setting of that control), no checks are performed on the secondary processor-based
VM-execution controls. The logical processor operates as if all the secondary processor-based VM-execution
controls were 0.
• The CR3-target count must not be greater than 4. Future processors may support a different number of CR3-
target values. Software should read the VMX capability MSR IA32_VMX_MISC to determine the number of
values supported (see Appendix A.6).
• If the “use I/O bitmaps” VM-execution control is 1, bits 11:0 of each I/O-bitmap address must be 0. Neither
address should set any bits beyond the processor’s physical-address width.1,2
• If the “use MSR bitmaps” VM-execution control is 1, bits 11:0 of the MSR-bitmap address must be 0. The
address should not set any bits beyond the processor’s physical-address width.3
• If the “use TPR shadow” VM-execution control is 1, the virtual-APIC address must satisfy the following checks:
— Bits 11:0 of the address must be 0.
— The address should not set any bits beyond the processor’s physical-address width.4
If all of the above checks are satisfied and the “use TPR shadow” VM-execution control is 1, bytes 3:1 of VTPR
(see Section 29.1.1) may be cleared (behavior may be implementation-specific).
The clearing of these bytes may occur even if the VM entry fails. This is true either if the failure causes control
to pass to the instruction following the VM-entry instruction or if it causes processor state to be loaded from
the host-state area of the VMCS.
• If the “use TPR shadow” VM-execution control is 1 and the “virtual-interrupt delivery” VM-execution control is
0, bits 31:4 of the TPR threshold VM-execution control field must be 0.5
• The following check is performed if the “use TPR shadow” VM-execution control is 1 and the “virtualize APIC
accesses” and “virtual-interrupt delivery” VM-execution controls are both 0: the value of bits 3:0 of the TPR
threshold VM-execution control field should not be greater than the value of bits 7:4 of VTPR (see Section
29.1.1).
• If the “NMI exiting” VM-execution control is 0, the “virtual NMIs” VM-execution control must be 0.
• If the “virtual NMIs” VM-execution control is 0, the “NMI-window exiting” VM-execution control must be 0.
• If the “virtualize APIC-accesses” VM-execution control is 1, the APIC-access address must satisfy the following
checks:
— Bits 11:0 of the address must be 0.
— The address should not set any bits beyond the processor’s physical-address width.6
• If the “use TPR shadow” VM-execution control is 0, the following VM-execution controls must also be 0:
“virtualize x2APIC mode”, “APIC-register virtualization”, and “virtual-interrupt delivery”.7
1. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
2. If IA32_VMX_BASIC[48] is read as 1, these addresses must not set any bits in the range 63:32; see Appendix A.1.
3. If IA32_VMX_BASIC[48] is read as 1, this address must not set any bits in the range 63:32; see Appendix A.1.
4. If IA32_VMX_BASIC[48] is read as 1, this address must not set any bits in the range 63:32; see Appendix A.1.
5. “Virtual-interrupt delivery” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-exe-
cution controls is 0, VM entry functions as if the “virtual-interrupt delivery” VM-execution control were 0. See Section 24.6.2.
6. If IA32_VMX_BASIC[48] is read as 1, this address must not set any bits in the range 63:32; see Appendix A.1.
7. “Virtualize x2APIC mode” and “APIC-register virtualization” are secondary processor-based VM-execution controls. If bit 31 of the
primary processor-based VM-execution controls is 0, VM entry functions as if these controls were 0. See Section 24.6.2.
Vol. 3C 26-3
VM ENTRIES
• If the “virtualize x2APIC mode” VM-execution control is 1, the “virtualize APIC accesses” VM-execution control
must be 0.
• If the “virtual-interrupt delivery” VM-execution control is 1, the “external-interrupt exiting” VM-execution
control must be 1.
• If the “process posted interrupts” VM-execution control is 1, the following must be true:1
— The “virtual-interrupt delivery” VM-execution control is 1.
— The “acknowledge interrupt on exit” VM-exit control is 1.
— The posted-interrupt notification vector has a value in the range 0–255 (bits 15:8 are all 0).
— Bits 5:0 of the posted-interrupt descriptor address are all 0.
— The posted-interrupt descriptor address does not set any bits beyond the processor's physical-address
width.2
• If the “enable VPID” VM-execution control is 1, the value of the VPID VM-execution control field must not be
0000H.3
• If the “enable EPT” VM-execution control is 1, the EPTP VM-execution control field (see Table 24-8 in Section
24.6.11) must satisfy the following checks:4
— The EPT memory type (bits 2:0) must be a value supported by the processor as indicated in the
IA32_VMX_EPT_VPID_CAP MSR (see Appendix A.10).
— Bits 5:3 (1 less than the EPT page-walk length) must be 3, indicating an EPT page-walk length of 4; see
Section 28.2.2.
— Bit 6 (enable bit for accessed and dirty flags for EPT) must be 0 if bit 21 of the IA32_VMX_EPT_VPID_CAP
MSR (see Appendix A.10) is read as 0, indicating that the processor does not support accessed and dirty
flags for EPT.
— Reserved bits 11:7 and 63:N (where N is the processor’s physical-address width) must all be 0.
• If the “enable PML” VM-execution control is 1, the “enable EPT” VM-execution control must also be 1.5 In
addition, the PML address must satisfy the following checks:
— Bits 11:0 of the address must be 0.
— The address should not set any bits beyond the processor’s physical-address width.
• If either the “unrestricted guest” VM-execution control or the “mode-based execute control for EPT” VM-
execution control is 1, the “enable EPT” VM-execution control must also be 1.6
• If the “sub-page write permissions for EPT” VM-execution control is 1, the “enable EPT” VM-execution control
must also be 1.7 In addition, the SPPTP VM-execution control field (see Table 24-10 in Section 24.6.21) must
satisfy the following checks:
— Bits 11:0 of the address must be 0.
1. “Process posted interrupts” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-exe-
cution controls is 0, VM entry functions as if the “process posted interrupts” VM-execution control were 0. See Section 24.6.2.
2. If IA32_VMX_BASIC[48] is read as 1, this address must not set any bits in the range 63:32; see Appendix A.1.
3. “Enable VPID” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution controls
is 0, VM entry functions as if the “enable VPID” VM-execution control were 0. See Section 24.6.2.
4. “Enable EPT” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution controls
is 0, VM entry functions as if the “enable EPT” VM-execution control were 0. See Section 24.6.2.
5. “Enable PML” and “enable EPT” are both secondary processor-based VM-execution controls. If bit 31 of the primary processor-based
VM-execution controls is 0, VM entry functions as if both these controls were 0. See Section 24.6.2.
6. All these controls are secondary processor-based VM-execution controls. If bit 31 of the primary processor-based VM-execution con-
trols is 0, VM entry functions as if all these controls were 0. See Section 24.6.2.
7. “Sub-page write permissions for EPT” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based
VM-execution controls is 0, VM entry functions as if the “sub-page write permissions for EPT” VM-execution control were 0. See Sec-
tion 24.6.2.
26-4 Vol. 3C
VM ENTRIES
— The address should not set any bits beyond the processor’s physical-address width.
• If the “enable VM functions” processor-based VM-execution control is 1, reserved bits in the VM-function
controls must be clear.1 Software may consult the VMX capability MSRs to determine which bits are reserved
(see Appendix A.11). In addition, the following check is performed based on the setting of bits in the VM-
function controls (see Section 24.6.14):
— If “EPTP switching” VM-function control is 1, the “enable EPT” VM-execution control must also be 1. In
addition, the EPTP-list address must satisfy the following checks:
• Bits 11:0 of the address must be 0.
• The address must not set any bits beyond the processor’s physical-address width.
If the “enable VM functions” processor-based VM-execution control is 0, no checks are performed on the VM-
function controls.
• If the “VMCS shadowing” VM-execution control is 1, the VMREAD-bitmap and VMWRITE-bitmap addresses
must each satisfy the following checks:2
— Bits 11:0 of the address must be 0.
— The address must not set any bits beyond the processor’s physical-address width.
• If the “EPT-violation #VE” VM-execution control is 1, the virtualization-exception information address must
satisfy the following checks:3
— Bits 11:0 of the address must be 0.
— The address must not set any bits beyond the processor’s physical-address width.
• If the logical processor is operating with Intel PT enabled (if IA32_RTIT_CTL.TraceEn = 1) at the time of
VM entry, the “load IA32_RTIT_CTL” VM-entry control must be 0.
• If the “Intel PT uses guest physical addresses” VM-execution control is 1, the following controls must also be 1:
the “enable EPT” VM-execution control; the “load IA32_RTIT_CTL” VM-entry control; and the “clear
IA32_RTIT_CTL” VM-exit control.4
• If the “use TSC scaling” VM-execution control is 1, the TSC-multiplier must not be zero.5
1. “Enable VM functions” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VM entry functions as if the “enable VM functions” VM-execution control were 0. See Section 24.6.2.
2. “VMCS shadowing” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VM entry functions as if the “VMCS shadowing” VM-execution control were 0. See Section 24.6.2.
3. “EPT-violation #VE” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VM entry functions as if the “EPT-violation #VE” VM-execution control were 0. See Section 24.6.2.
4. “Intel PT uses guest physical addresses” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-
based VM-execution controls is 0, VM entry functions as if the “Intel PT uses guest physical addresses” VM-execution control were
0. See Section 24.6.2.
5. “Use TSC scaling” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VM entry functions as if the "use TSC scaling" VM-execution control were 0. See Section 24.6.2.
Vol. 3C 26-5
VM ENTRIES
— The address of the last byte in the VM-exit MSR-store area should not set any bits beyond the processor’s
physical-address width. The address of this last byte is VM-exit MSR-store address + (MSR count * 16) – 1.
(The arithmetic used for the computation uses more bits than the processor’s physical-address width.)
If IA32_VMX_BASIC[48] is read as 1, neither address should set any bits in the range 63:32; see Appendix
A.1.
• The following checks are performed for the VM-exit MSR-load address if the VM-exit MSR-load count field is
non-zero:
— The lower 4 bits of the VM-exit MSR-load address must be 0. The address should not set any bits beyond
the processor’s physical-address width.
— The address of the last byte in the VM-exit MSR-load area should not set any bits beyond the processor’s
physical-address width. The address of this last byte is VM-exit MSR-load address + (MSR count * 16) – 1.
(The arithmetic used for the computation uses more bits than the processor’s physical-address width.)
If IA32_VMX_BASIC[48] is read as 1, neither address should set any bits in the range 63:32; see Appendix
A.1.
6. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
26-6 Vol. 3C
VM ENTRIES
— The address of the last byte in the VM-entry MSR-load area should not set any bits beyond the processor’s
physical-address width. The address of this last byte is VM-entry MSR-load address + (MSR count * 16) –
1. (The arithmetic used for the computation uses more bits than the processor’s physical-address width.)
If IA32_VMX_BASIC[48] is read as 1, neither address should set any bits in the range 63:32; see Appendix
A.1.
• If the processor is not in SMM, the “entry to SMM” and “deactivate dual-monitor treatment” VM-entry controls
must be 0.
• The “entry to SMM” and “deactivate dual-monitor treatment” VM-entry controls cannot both be 1.
1. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
1. The bits corresponding to CR0.NW (bit 29) and CR0.CD (bit 30) are never checked because the values of these bits are not changed
by VM exit; see Section 27.5.1.
2. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
3. Bit 63 of the CR3 field in the host-state area must be 0. This is true even though, If CR4.PCIDE = 1, bit 63 of the source operand to
MOV to CR3 is used to determine whether cached translation information is invalidated.
Vol. 3C 26-7
VM ENTRIES
26-8 Vol. 3C
VM ENTRIES
• The CR0 field must not set any bit to a value not supported in VMX operation (see Section 23.8). The following
are exceptions:
— Bit 0 (corresponding to CR0.PE) and bit 31 (PG) are not checked if the “unrestricted guest” VM-execution
control is 1.1
— Bit 29 (corresponding to CR0.NW) and bit 30 (CD) are never checked because the values of these bits are
not changed by VM entry; see Section 26.3.2.1.
• If bit 31 in the CR0 field (corresponding to PG) is 1, bit 0 in that field (PE) must also be 1.2
• The CR4 field must not set any bit to a value not supported in VMX operation (see Section 23.8).
• If bit 23 in the CR4 field (corresponding to CET) is 1, bit 16 in the CR0 field (WP) must also be 1.
• If the “load debug controls” VM-entry control is 1, bits reserved in the IA32_DEBUGCTL MSR must be 0 in the
field for that register. The first processors to support the virtual-machine extensions supported only the 1-
setting of this control and thus performed this check unconditionally.
• The following checks are performed on processors that support Intel 64 architecture:
— If the “IA-32e mode guest” VM-entry control is 1, bit 31 in the CR0 field (corresponding to CR0.PG) and
bit 5 in the CR4 field (corresponding to CR4.PAE) must each be 1.3
— If the “IA-32e mode guest” VM-entry control is 0, bit 17 in the CR4 field (corresponding to CR4.PCIDE)
must be 0.
— The CR3 field must be such that bits 63:52 and bits in the range 51:32 beyond the processor’s physical-
address width are 0.4,5
— If the “load debug controls” VM-entry control is 1, bits 63:32 in the DR7 field must be 0. The first
processors to support the virtual-machine extensions supported only the 1-setting of this control and thus
performed this check unconditionally (if they supported Intel 64 architecture).
— The IA32_SYSENTER_ESP field and the IA32_SYSENTER_EIP field must each contain a canonical address.
— If the “load CET state” VM-entry control is 1, the IA32_S_CET field and the
IA32_INTERRUPT_SSP_TABLE_ADDR field must contain canonical addresses.
• If the “load IA32_PERF_GLOBAL_CTRL” VM-entry control is 1, bits reserved in the IA32_PERF_GLOBAL_CTRL
MSR must be 0 in the field for that register (see Figure 18-3).
• If the “load IA32_PAT” VM-entry control is 1, the value of the field for the IA32_PAT MSR must be one that could
be written by WRMSR without fault at CPL 0. Specifically, each of the 8 bytes in the field must have one of the
values 0 (UC), 1 (WC), 4 (WT), 5 (WP), 6 (WB), or 7 (UC-).
• If the “load IA32_EFER” VM-entry control is 1, the following checks are performed on the field for the
IA32_EFER MSR:
— Bits reserved in the IA32_EFER MSR must be 0.
— Bit 10 (corresponding to IA32_EFER.LMA) must equal the value of the “IA-32e mode guest” VM-entry
control. It must also be identical to bit 8 (LME) if bit 31 in the CR0 field (corresponding to CR0.PG) is 1.6
1. “Unrestricted guest” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VM entry functions as if the “unrestricted guest” VM-execution control were 0. See Section 24.6.2.
2. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PE must be 1 in VMX operation, bit 0 in the CR0 field must be 1
unless the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution controls are both 1.
3. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PG must be 1 in VMX operation, bit 31 in the CR0 field must be 1
unless the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution controls are both 1.
4. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
5. Bit 63 of the CR3 field in the guest-state area must be 0. This is true even though, If CR4.PCIDE = 1, bit 63 of the source operand to
MOV to CR3 is used to determine whether cached translation information is invalidated.
6. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PG must be 1 in VMX operation, bit 31 in the CR0 field must be 1
unless the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution controls are both 1.
Vol. 3C 26-9
VM ENTRIES
• If the “load IA32_BNDCFGS” VM-entry control is 1, the following checks are performed on the field for the
IA32_BNDCFGS MSR:
— Bits reserved in the IA32_BNDCFGS MSR must be 0.
— The linear address in bits 63:12 must be canonical.
• If the “load IA32_RTIT_CTL” VM-entry control is 1, bits reserved in the IA32_RTIT_CTL MSR must be 0 in the
field for that register (see Table 35-6).
• If the “load CET state” VM-entry control is 1, the IA32_S_CET field must not set any bits reserved in the
IA32_S_CET MSR, and bit 10 (corresponding to SUPPRESS) and bit 11 (TRACKER) of the field cannot both be
set.
1. “Unrestricted guest” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VM entry functions as if the “unrestricted guest” VM-execution control were 0. See Section 24.6.2.
26-10 Vol. 3C
VM ENTRIES
• If the guest will not be virtual-8086, the different sub-fields are considered separately:
— Bits 3:0 (Type).
• CS. The values allowed depend on the setting of the “unrestricted guest” VM-execution
control:
— If the control is 0, the Type must be 9, 11, 13, or 15 (accessed code segment).
— If the control is 1, the Type must be either 3 (read/write accessed expand-up data
segment) or one of 9, 11, 13, and 15 (accessed code segment).
• SS. If SS is usable, the Type must be 3 or 7 (read/write, accessed data segment).
• DS, ES, FS, GS. The following checks apply if the register is usable:
— Bit 0 of the Type must be 1 (accessed).
— If bit 3 of the Type is 1 (code segment), then bit 1 of the Type must be 1 (readable).
— Bit 4 (S). If the register is CS or if the register is usable, S must be 1.
— Bits 6:5 (DPL).
• CS.
— If the Type is 3 (read/write accessed expand-up data segment), the DPL must be 0. The
Type can be 3 only if the “unrestricted guest” VM-execution control is 1.
— If the Type is 9 or 11 (non-conforming code segment), the DPL must equal the DPL in the
access-rights field for SS.
— If the Type is 13 or 15 (conforming code segment), the DPL cannot be greater than the
DPL in the access-rights field for SS.
• SS.
— If the “unrestricted guest” VM-execution control is 0, the DPL must equal the RPL from the
selector field.
— The DPL must be 0 either if the Type in the access-rights field for CS is 3 (read/write
accessed expand-up data segment) or if bit 0 in the CR0 field (corresponding to CR0.PE) is
0.1
• DS, ES, FS, GS. The DPL cannot be less than the RPL in the selector field if (1) the
“unrestricted guest” VM-execution control is 0; (2) the register is usable; and (3) the Type in
the access-rights field is in the range 0 – 11 (data segment or non-conforming code segment).
— Bit 7 (P). If the register is CS or if the register is usable, P must be 1.
— Bits 11:8 (reserved). If the register is CS or if the register is usable, these bits must all be 0.
— Bit 14 (D/B). For CS, D/B must be 0 if the guest will be IA-32e mode and the L bit (bit 13) in the
access-rights field is 1.
— Bit 15 (G). The following checks apply if the register is CS or if the register is usable:
• If any bit in the limit field in the range 11:0 is 0, G must be 0.
• If any bit in the limit field in the range 31:20 is 1, G must be 1.
— Bits 31:17 (reserved). If the register is CS or if the register is usable, these bits must all be 0.
— TR. The different sub-fields are considered separately:
• Bits 3:0 (Type).
— If the guest will not be IA-32e mode, the Type must be 3 (16-bit busy TSS) or 11 (32-bit busy
TSS).
1. The following apply if either the “unrestricted guest” VM-execution control or bit 31 of the primary processor-based VM-execution
controls is 0: (1) bit 0 in the CR0 field must be 1 if the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PE must be 1 in
VMX operation; and (2) the Type in the access-rights field for CS cannot be 3.
Vol. 3C 26-11
VM ENTRIES
— If the guest will be IA-32e mode, the Type must be 11 (64-bit busy TSS).
• Bit 4 (S). S must be 0.
• Bit 7 (P). P must be 1.
• Bits 11:8 (reserved). These bits must all be 0.
• Bit 15 (G).
— If any bit in the limit field in the range 11:0 is 0, G must be 0.
— If any bit in the limit field in the range 31:20 is 1, G must be 1.
• Bit 16 (Unusable). The unusable bit must be 0.
• Bits 31:17 (reserved). These bits must all be 0.
— LDTR. The following checks on the different sub-fields apply only if LDTR is usable:
• Bits 3:0 (Type). The Type must be 2 (LDT).
• Bit 4 (S). S must be 0.
• Bit 7 (P). P must be 1.
• Bits 11:8 (reserved). These bits must all be 0.
• Bit 15 (G).
— If any bit in the limit field in the range 11:0 is 0, G must be 0.
— If any bit in the limit field in the range 31:20 is 1, G must be 1.
• Bits 31:17 (reserved). These bits must all be 0.
1. Software can determine the number N by executing CPUID with 80000008H in EAX. The number of linear-address bits supported is
returned in bits 15:8 of EAX.
2. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PE must be 1 in VMX operation, bit 0 in the CR0 field must be 1
unless the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution controls are both 1.
26-12 Vol. 3C
VM ENTRIES
• SSP. If the “load CET state” VM-entry control is 1, the following checks are performed on processors that
support Intel 64 architecture:
— Bits 63:32 must be 0 if the “IA-32e mode guest” VM-entry control is 0 or if the L bit (bit 13) in the access-
rights field for CS is 0.
— If the processor supports N < 64 linear-address bits, bits 63:N must be identical if the “IA-32e mode guest”
VM-entry control is 1 and the L bit in the access-rights field for CS is 1. (No check applies if the processor
supports 64 linear-address bits.)
1. As noted in Section 24.4.1, SS.DPL corresponds to the logical processor’s current privilege level (CPL).
Vol. 3C 26-13
VM ENTRIES
— Bit 3 (blocking by NMI) must be 0 if the “virtual NMIs” VM-execution control is 1, the valid bit (bit 31) in the
VM-entry interruption-information field is 1, and the interruption type (bits 10:8) in that field has value 2
(indicating NMI).
— If bit 4 (enclave interruption) is 1, bit 1 (blocking by MOV-SS) must be 0 and the processor must support
for SGX by enumerating CPUID.(EAX=07H,ECX=0):EBX.SGX[bit 2] as 1.
NOTE
If the “virtual NMIs” VM-execution control is 0, there is no requirement that bit 3 be 0 if the valid bit
in the VM-entry interruption-information field is 1 and the interruption type in that field has value 2.
1. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
2. If IA32_VMX_BASIC[48] is read as 1, this field must not set any bits in the range 63:32; see Appendix A.1.
3. Earlier versions of this manual specified that the VMCS revision identifier was a 32-bit field. For all processors produced prior to this
change, bit 31 of the VMCS revision identifier was 0.
4. “VMCS shadowing” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VM entry functions as if the “VMCS shadowing” VM-execution control were 0. See Section 24.6.2.
26-14 Vol. 3C
VM ENTRIES
1. On processors that support Intel 64 architecture, the physical-address extension may support more than 36 physical-address bits.
Software can determine the number physical-address bits supported by executing CPUID with 80000008H in EAX. The physical-
address width is returned in bits 7:0 of EAX.
2. “Enable EPT” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution controls
is 0, VM entry functions as if the “enable EPT” VM-execution control were 0. See Section 24.6.2.
3. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0 (present) is clear in one of the PDPTEs, bits 63:1 of that
PDPTE are ignored.
4. Bits 15:6, bit 17, and bit 28:19 of CR0 and CR0.ET are unchanged by executions of MOV to CR0. Bits 15:6, bit 17, and bit 28:19 of
CR0 are always 0 and CR0.ET is always 1.
Vol. 3C 26-15
VM ENTRIES
• If the “load debug controls” VM-entry control is 1, DR7 is loaded from the DR7 field with the exception that
bit 12 and bits 15:14 are always 0 and bit 10 is always 1. The values of these bits in the DR7 field are ignored.
The first processors to support the virtual-machine extensions supported only the 1-setting of the “load
debug controls” VM-entry control and thus always loaded DR7 from the DR7 field.
• The following describes how certain MSRs are loaded using fields in the guest-state area:
— If the “load debug controls” VM-entry control is 1, the IA32_DEBUGCTL MSR is loaded from the
IA32_DEBUGCTL field. The first processors to support the virtual-machine extensions supported only the 1-
setting of this control and thus always loaded the IA32_DEBUGCTL MSR from the IA32_DEBUGCTL field.
— The IA32_SYSENTER_CS MSR is loaded from the IA32_SYSENTER_CS field. Since this field has only 32 bits,
bits 63:32 of the MSR are cleared to 0.
— The IA32_SYSENTER_ESP and IA32_SYSENTER_EIP MSRs are loaded from the IA32_SYSENTER_ESP field
and the IA32_SYSENTER_EIP field, respectively. On processors that do not support Intel 64 architecture,
these fields have only 32 bits; bits 63:32 of the MSRs are cleared to 0.
— The following are performed on processors that support Intel 64 architecture:
• The MSRs FS.base and GS.base are loaded from the base-address fields for FS and GS, respectively
(see Section 26.3.2.2).
• If the “load IA32_EFER” VM-entry control is 0, bits in the IA32_EFER MSR are modified as follows:
— IA32_EFER.LMA is loaded with the setting of the “IA-32e mode guest” VM-entry control.
— If CR0 is being loaded so that CR0.PG = 1, IA32_EFER.LME is also loaded with the setting of the
“IA-32e mode guest” VM-entry control.1 Otherwise, IA32_EFER.LME is unmodified.
See below for the case in which the “load IA32_EFER” VM-entry control is 1
— If the “load IA32_PERF_GLOBAL_CTRL” VM-entry control is 1, the IA32_PERF_GLOBAL_CTRL MSR is loaded
from the IA32_PERF_GLOBAL_CTRL field.
— If the “load IA32_PAT” VM-entry control is 1, the IA32_PAT MSR is loaded from the IA32_PAT field.
— If the “load IA32_EFER” VM-entry control is 1, the IA32_EFER MSR is loaded from the IA32_EFER field.
— If the “load IA32_BNDCFGS” VM-entry control is 1, the IA32_BNDCFGS MSR is loaded from the
IA32_BNDCFGS field.
— If the “load IA32_RTIT_CTL” VM-entry control is 1, the IA32_RTIT_CTL MSR is loaded from the
IA32_RTIT_CTL field.
— If the “load CET” VM-entry control is 1, the IA32_S_CET and IA32_INTERRUPT_SSP_TABLE_ADDR MSRs
are loaded from the IA32_S_CET field and the IA32_INTERRUPT_SSP_TABLE_ADDR field, respectively. On
processors that do not support Intel 64 architecture, these fields have only 32 bits; bits 63:32 of the MSRs
are cleared to 0.
With the exception of FS.base and GS.base, any of these MSRs is subsequently overwritten if it appears in the
VM-entry MSR-load area. See Section 26.4.
• The SMBASE register is unmodified by all VM entries except those that return from SMM.
1. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PG must be 1 in VMX operation, VM entry must be loading CR0 so
that CR0.PG = 1 unless the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution con-
trols are both 1.
26-16 Vol. 3C
VM ENTRIES
— If this bit is set for LDTR, uses of LDTR cause general-protection exceptions in all modes, just as they would
had LDTR been loaded using a null selector.
If this bit is clear for any of CS, SS, DS, ES, FS, GS, TR, and LDTR, a null selector value does not cause a fault
(general-protection exception or stack-fault exception).
• TR. The selector, base, limit, and access-rights fields are loaded.
• CS.
— The following fields are always loaded: selector, base address, limit, and (from the access-rights field) the
L, D, and G bits.
— For the other fields, the unusable bit of the access-rights field is consulted:
• If the unusable bit is 0, all of the access-rights field is loaded.
• If the unusable bit is 1, the remainder of CS access rights are undefined after VM entry.
• SS, DS, ES, FS, GS, and LDTR.
— The selector fields are loaded.
— For the other fields, the unusable bit of the corresponding access-rights field is consulted:
• If the unusable bit is 0, the base-address, limit, and access-rights fields are loaded.
• If the unusable bit is 1, the base address, the segment limit, and the remainder of the access rights are
undefined after VM entry with the following exceptions:
— Bits 3:0 of the base address for SS are cleared to 0.
— SS.DPL is always loaded from the SS access-rights field. This will be the current privilege level
(CPL) after the VM entry completes.
— SS.B is always set to 1.
— The base addresses for FS and GS are loaded from the corresponding fields in the VMCS. On
processors that support Intel 64 architecture, the values loaded for base addresses for FS and GS
are also manifest in the FS.base and GS.base MSRs.
— On processors that support Intel 64 architecture, the base address for LDTR is set to an undefined
but canonical value.
— On processors that support Intel 64 architecture, bits 63:32 of the base addresses for SS, DS, and
ES are cleared to 0.
GDTR and IDTR are loaded using the base and limit fields.
Vol. 3C 26-17
VM ENTRIES
• If the control is 0, the PDPTEs are loaded from the page-directory-pointer table referenced by the physical
address in the value of CR3 being loaded by the VM entry (see Section 26.3.2.1). The values loaded are treated
as physical addresses in VMX non-root operation.
• If the control is 1, the PDPTEs are loaded from corresponding fields in the guest-state area (see Section
24.4.2). The values loaded are treated as guest-physical addresses in VMX non-root operation.
1. Because attempts to modify the value of IA32_EFER.LMA by WRMSR are ignored, attempts to modify it using the VM-entry MSR-
load area are also ignored.
26-18 Vol. 3C
VM ENTRIES
The VM entry fails if processing fails for any entry. The logical processor responds to such failures by loading state
from the host-state area, as it would for a VM exit. See Section 26.8.
If any MSR is being loaded in such a way that would architecturally require a TLB flush, the TLBs are updated so
that, after VM entry, the logical processor will not use any translations that were cached before the transition.
2. If CR0.PG = 1, WRMSR to the IA32_EFER MSR causes a general-protection exception if it would modify the LME bit. If VM entry has
established CR0.PG = 1, the IA32_EFER MSR should not be included in the VM-entry MSR-load area for the purpose of modifying the
LME bit.
1. This does not imply that injection of an exception or interrupt will cause a VM exit due to the settings of VM-execution control fields
(such as the exception bitmap) that would cause a VM exit if the event had occurred in VMX non-root operation. In contrast, a
nested exception encountered during event delivery may cause a VM exit; see Section 26.6.1.1.
Vol. 3C 26-19
VM ENTRIES
depending on the nature of the event whose delivery encountered the nested exception. See Chapter 6,
“Interrupt 8—Double Fault Exception (#DF)” in Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A.1
• If the bit for the nested exception is 1, a VM exit occurs. Section 26.6.1.2 details cases in which event injection
causes a VM exit.
1. Hardware exceptions with the following unused vectors are considered benign: 15 and 21–31. A hardware exception with vector 20
is considered benign unless the processor supports the 1-setting of the “EPT-violation #VE” VM-execution control; in that case, it
has the same severity as page faults.
2. While these items refer to RIP, the width of the value pushed (16 bits, 32 bits, or 64 bits) is determined normally.
26-20 Vol. 3C
VM ENTRIES
software interrupt to cause such a exception. Thus, in this case, the injection invokes a protected-mode
interrupt handler independent of the value of RFLAGS.IOPL.
Injection of events of other types are not subject to this redirection.
• If VM entry is injecting a software interrupt (not redirected as described above) or software exception, privilege
checking is performed on the IDT descriptor being accessed as would be the case for executions of INT n, INT3,
or INTO (the descriptor’s DPL cannot be less than CPL). There is no checking of RFLAGS.IOPL, even if the guest
will be in virtual-8086 mode. Failure of this check may lead to a nested exception. Injection of an event with
interruption type external interrupt, NMI, hardware exception, and privileged software exception, or with inter-
ruption type software interrupt and being redirected as described above, do not perform these checks.
• If VM entry is injecting a non-maskable interrupt (NMI) and the “virtual NMIs” VM-execution control is 1,
virtual-NMI blocking is in effect after VM entry.
• The transition causes a last-branch record to be logged if the LBR bit is set in the IA32_DEBUGCTL MSR. This is
true even for events such as debug exceptions, which normally clear the LBR bit before delivery.
• The last-exception record MSRs (LERs) may be updated based on the setting of the LBR bit in the
IA32_DEBUGCTL MSR. Events such as debug exceptions, which normally clear the LBR bit before they are
delivered, and therefore do not normally update the LERs, may do so as part of VM-entry event injection.
• If injection of an event encounters a nested exception, the value of the EXT bit (bit 0) in any error code for that
nested exception is determined as follows:
— If event being injected has interruption type external interrupt, NMI, hardware exception, or privileged
software exception and encounters a nested exception (but does not produce a double fault), the error code
for that exception sets the EXT bit.
— If event being injected is a software interrupt or a software exception and encounters a nested exception,
the error code for that exception clears the EXT bit.
— If event delivery encounters a nested exception and delivery of that exception encounters another
exception (but does not produce a double fault), the error code for that exception sets the EXT bit.
— If a double fault is produced, the error code for the double fault is 0000H (the EXT bit is clear).
1. “Virtualize APIC accesses” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execu-
tion controls is 0, VM entry functions as if the “virtualize APIC accesses” VM-execution control were 0. See Section 24.6.2.
Vol. 3C 26-21
VM ENTRIES
1. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PE must be 1 in VMX operation, VM entry must be loading CR0.PE
with 1 unless the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution controls are
both 1.
26-22 Vol. 3C
VM ENTRIES
• The bit’s value does not affect the blocking of NMIs after VM entry. NMIs are not blocked in VMX non-
root operation (except for ordinary blocking for other reasons, such as by the MOV SS instruction, the
wait-for-SIPI state, etc.)
• The bit’s value determines whether there is virtual-NMI blocking after VM entry. If the bit is 1, virtual-
NMI blocking is in effect after VM entry. If the bit is 0, there is no virtual-NMI blocking after VM entry
unless the VM entry is injecting an NMI (see Section 26.6.1.1). Execution of IRET removes virtual-NMI
blocking (even if the instruction generates a fault).
If the “NMI exiting” VM-execution control is 0, the “virtual NMIs” control must be 0; see Section 26.2.1.1.
• Blocking of system-management interrupts (SMIs) is determined as follows:
— If the VM entry was not executed in system-management mode (SMM), SMI blocking is unchanged by
VM entry.
— If the VM entry was executed in SMM, SMIs are blocked after VM entry if and only if the bit 2 in the inter-
ruptibility-state field is 1.
1. A logical processor is in SMX operation if GETSEC[SEXIT] has not been executed since the last execution of GETSEC[SENTER]. See
Chapter 6, “Safer Mode Extensions Reference,” in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B.
Vol. 3C 26-23
VM ENTRIES
• The interruptibility-state field does not indicate blocking by MOV SS and the VM entry is vectoring with either of
the following interruption type: software interrupt or software exception.
• The VM entry is not vectoring and the activity-state field indicates either shutdown or wait-for-SIPI.
If none of the above hold, the pending debug exceptions field specifies the debug exceptions that are pending for
the guest. There are valid pending debug exceptions if either the BS bit (bit 14) or the enable-breakpoint bit
(bit 12) is 1. If there are valid pending debug exceptions, they are handled as follows:
• If the VM entry is not vectoring, the pending debug exceptions are treated as they would had they been
encountered normally in guest execution:
— If the logical processor is not blocking such exceptions (the interruptibility-state field indicates no blocking
by MOV SS), a debug exception is delivered after VM entry (see below).
— If the logical processor is blocking such exceptions (due to blocking by MOV SS), the pending debug
exceptions are held pending or lost as would normally be the case.
• If the VM entry is vectoring (with interruption type software interrupt or software exception and with blocking
by MOV SS), the following items apply:
— For injection of a software interrupt or of a software exception with vector 3 (#BP) or vector 4 (#OF) — or
a privileged software exception with vector 1 (#DB) — the pending debug exceptions are treated as they
would had they been encountered normally in guest execution if the corresponding instruction (INT1, INT3,
or INTO) were executed after a MOV SS that encountered a debug trap.
— For injection of a software exception with a vector other than 3 and 4, the pending debug exceptions may
be lost or they may be delivered after injection (see below).
If there are no valid pending debug exceptions (as defined above), no pending debug exceptions are delivered after
VM entry.
If a pending debug exception is delivered after VM entry, it has the priority of “traps on the previous instruction”
(see Section 6.9 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). Thus, INIT
signals and system-management interrupts (SMIs) take priority of such an exception, as do VM exits induced by
the TPR threshold (see Section 26.7.7) and pending MTF VM exits (see Section 26.7.8. The exception takes priority
over any pending non-maskable interrupt (NMI) or external interrupt and also over VM exits due to the 1-settings
of the “interrupt-window exiting” and “NMI-window exiting” VM-execution controls.
A pending debug exception delivered after VM entry causes a VM exit if the bit 1 (#DB) is 1 in the exception
bitmap. If it does not cause a VM exit, it updates DR6 normally.
26-24 Vol. 3C
VM ENTRIES
• These events occur after any event injection specified for VM entry.
• Non-maskable interrupts (NMIs) and higher priority events take priority over these events. These events take
priority over external interrupts and lower priority events.
• These events wake the logical processor if it just entered the HLT state because of a VM entry (see Section
26.7.2). They do not occur if the logical processor just entered the shutdown state or the wait-for-SIPI state.
1. “Virtualize APIC accesses” and “virtual-interrupt delivery” are secondary processor-based VM-execution controls. If bit 31 of the pri-
mary processor-based VM-execution controls is 0, VM entry functions as if these controls were 0. See Section 24.6.2.
Vol. 3C 26-25
VM ENTRIES
26-26 Vol. 3C
VM ENTRIES
1. A logical processor is in SMX operation if GETSEC[SEXIT] has not been executed since the last execution of GETSEC[SENTER]. A logi-
cal processor is outside SMX operation if GETSEC[SENTER] has not been executed or if GETSEC[SEXIT] was executed after the last
execution of GETSEC[SENTER]. See Chapter 6, “Safer Mode Extensions Reference,” in Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2B.
Vol. 3C 26-27
VM ENTRIES
26-28 Vol. 3C
22.Updates to Chapter 27, Volume 3C
Change bars and green text show changes to Chapter 27 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3C: System Programming Guide, Part 3.
------------------------------------------------------------------------------------------
Changes to chapter: Minor updates to Section 27.1 “Architectural State Before a VM Exit”. Added Control-flow
Enforcement Technology details.
VM exits occur in response to certain instructions and events in VMX non-root operation as detailed in Section 25.1
through Section 25.2. VM exits perform the following operations:
1. Information about the cause of the VM exit is recorded in the VM-exit information fields and VM-entry control
fields are modified as described in Section 27.2.
2. Processor state is saved in the guest-state area (Section 27.3).
3. MSRs may be saved in the VM-exit MSR-store area (Section 27.4). This step is not performed for SMM VM exits
that activate the dual-monitor treatment of SMIs and SMM.
4. The following may be performed in parallel and in any order (Section 27.5):
— Processor state is loaded based in part on the host-state area and some VM-exit controls. This step is not
performed for SMM VM exits that activate the dual-monitor treatment of SMIs and SMM. See Section
34.15.6 for information on how processor state is loaded by such VM exits.
— Address-range monitoring is cleared.
5. MSRs may be loaded from the VM-exit MSR-load area (Section 27.6). This step is not performed for SMM
VM exits that activate the dual-monitor treatment of SMIs and SMM.
VM exits are not logged with last-branch records, do not produce branch-trace messages, and do not update the
branch-trace store.
Section 27.1 clarifies the nature of the architectural state before a VM exit begins. The steps described above are
detailed in Section 27.2 through Section 27.6.
Section 34.15 describes the dual-monitor treatment of system-management interrupts (SMIs) and system-
management mode (SMM). Under this treatment, ordinary transitions to SMM are replaced by VM exits to a sepa-
rate SMM monitor. Called SMM VM exits, these are caused by the arrival of an SMI or the execution of VMCALL in
VMX root operation. SMM VM exits differ from other VM exits in ways that are detailed in Section 34.15.2.
Vol. 3C 27-1
VM EXITS
— An NMI causes subsequent NMIs to be blocked, but only after the VM exit completes.
— An external interrupt does not acknowledge the interrupt controller and the interrupt remains pending,
unless the “acknowledge interrupt on exit” VM-exit control is 1. In such a case, the interrupt controller is
acknowledged and the interrupt is no longer pending.
— The flags L0 – L3 in DR7 (bit 0, bit 2, bit 4, and bit 6) are not cleared when a task switch causes a VM exit.
— If a task switch causes a VM exit, none of the following are modified by the task switch: old task-state
segment (TSS); new TSS; old TSS descriptor; new TSS descriptor; RFLAGS.NT1; or the TR register.
— No last-exception record is made if the event that would do so directly causes a VM exit.
— If a machine-check exception causes a VM exit directly, this does not prevent machine-check MSRs from
being updated. These are updated by the machine-check event itself and not the resulting machine-check
exception.
— If the logical processor is in an inactive state (see Section 24.4.2) and not executing instructions, some
events may be blocked but others may return the logical processor to the active state. Unblocked events
may cause VM exits.2 If an unblocked event causes a VM exit directly, a return to the active state occurs
only after the VM exit completes.3 The VM exit generates any special bus cycle that is normally generated
when the active state is entered from that activity state.
MTF VM exits (see Section 25.5.2 and Section 26.7.8) are not blocked in the HLT activity state. If an MTF
VM exit occurs in the HLT activity state, the logical processor returns to the active state only after the
VM exit completes. MTF VM exits are blocked the shutdown state and the wait-for-SIPI state.
• If an event causes a VM exit indirectly, the event does update architectural state:
— A debug exception updates DR6, DR7, and the IA32_DEBUGCTL MSR. No debug exceptions are considered
pending.
— A page fault updates CR2.
— An NMI causes subsequent NMIs to be blocked before the VM exit commences.
— An external interrupt acknowledges the interrupt controller and the interrupt is no longer pending.
— If the logical processor had been in an inactive state, it enters the active state and, before the VM exit
commences, generates any special bus cycle that is normally generated when the active state is entered
from that activity state.
— There is no blocking by STI or by MOV SS when the VM exit commences.
— Processor state that is normally updated as part of delivery through the IDT (CS, RIP, SS, RSP, RFLAGS) is
not modified. However, the incomplete delivery of the event may write to the stack.
— The treatment of last-exception records is implementation dependent:
• Some processors make a last-exception record when beginning the delivery of an event through the IDT
(before it can encounter a nested exception). Such processors perform this update even if the event
encounters a nested exception that causes a VM exit (including the case where nested exceptions lead
to a triple fault).
• Other processors delay making a last-exception record until event delivery has reached some event
handler successfully (perhaps after one or more nested exceptions). Such processors do not update the
last-exception record if a VM exit or triple fault occurs before an event handler is reached.
1. This chapter uses the notation RAX, RIP, RSP, RFLAGS, etc. for processor registers because most processors that support VMX oper-
ation also support Intel 64 architecture. For processors that do not support Intel 64 architecture, this notation refers to the 32-bit
forms of those registers (EAX, EIP, ESP, EFLAGS, etc.). In a few places, notation such as EAX is used to refer specifically to lower 32
bits of the indicated register.
2. If a VM exit takes the processor from an inactive state resulting from execution of a specific instruction (HLT or MWAIT), the value
saved for RIP by that VM exit will reference the following instruction.
3. An exception is made if the logical processor had been inactive due to execution of MWAIT; in this case, it is considered to have
become active before the VM exit.
27-2 Vol. 3C
VM EXITS
• If the “virtual NMIs” VM-execution control is 1, VM entry injects an NMI, and delivery of the NMI causes a
nested exception, double fault, task switch, EPT violation, EPT misconfiguration, page-modification log-full
event, or SPP-related event, or APIC access that causes a VM exit, virtual-NMI blocking is in effect before the
VM exit commences.
• If a VM exit results from a fault, EPT violation, EPT misconfiguration, page-modification log-full event, or SPP-
related event that is encountered during execution of IRET and the “NMI exiting” VM-execution control is 0, any
blocking by NMI is cleared before the VM exit commences. However, the previous state of blocking by NMI may
be recorded in the exit qualification or in the VM-exit interruption-information field; see Section 27.2.3.
• If a VM exit results from a fault, EPT violation, EPT misconfiguration, page-modification log-full event, or SPP-
related event that is encountered during execution of IRET and the “virtual NMIs” VM-execution control is 1,
virtual-NMI blocking is cleared before the VM exit commences. However, the previous state of blocking by NMI
may be recorded in the exit qualification or in the VM-exit interruption-information field; see Section 27.2.3.
• Suppose that a VM exit is caused directly by an x87 FPU Floating-Point Error (#MF) or by any of the following
events if the event was unblocked due to (and given priority over) an x87 FPU Floating-Point Error: an INIT
signal, an external interrupt, an NMI, an SMI; or a machine-check exception. In these cases, there is no
blocking by STI or by MOV SS when the VM exit commences.
• Normally, a last-branch record may be made when an event is delivered through the IDT. However, if such an
event results in a VM exit before delivery is complete, no last-branch record is made.
• If machine-check exception results in a VM exit, processor state is suspect and may result in suspect state
being saved to the guest-state area. A VM monitor should consult the RIPV and EIPV bits in the
IA32_MCG_STATUS MSR before resuming a guest that caused a VM exit resulting from a machine-check
exception.
• If a VM exit results from a fault, APIC access (see Section 29.4), EPT violation, EPT misconfiguration, page-
modification log-full event, or SPP-related event that is encountered while executing an instruction, data
breakpoints due to that instruction may have been recognized and information about them may be saved in the
pending debug exceptions field (unless the VM exit clears that field; see Section 27.3.4).
• The following VM exits are considered to happen after an instruction is executed:
— VM exits resulting from debug traps (single-step, I/O breakpoints, and data breakpoints).
— VM exits resulting from debug exceptions (data breakpoints) whose recognition was delayed by blocking by
MOV SS.
— VM exits resulting from some machine-check exceptions.
— Trap-like VM exits due to execution of MOV to CR8 when the “CR8-load exiting” VM-execution control is 0
and the “use TPR shadow” VM-execution control is 1 (see Section 29.3). (Such VM exits can occur only from
64-bit mode and thus only on processors that support Intel 64 architecture.)
— Trap-like VM exits due to execution of WRMSR when the “use MSR bitmaps” VM-execution control is 1; the
value of ECX is in the range 800H–8FFH; and the bit corresponding to the ECX value in write bitmap for low
MSRs is 0; and the “virtualize x2APIC mode” VM-execution control is 1. See Section 29.5.
— VM exits caused by APIC-write emulation (see Section 29.4.3.2) that result from APIC accesses as part of
instruction execution.
For these VM exits, the instruction’s modifications to architectural state complete before the VM exit occurs.
Such modifications include those to the logical processor’s interruptibility state (see Table 24-3). If there had
been blocking by MOV SS, POP SS, or STI before the instruction executed, such blocking is no longer in effect.
A VM exit that occurs in enclave mode sets bit 27 of the exit-reason field and bit 4 of the guest interruptibility-state
field. Before such a VM exit is delivered, an Asynchronous Enclave Exit (AEX) occurs (see Chapter 39, “Enclave
Exiting Events”). An AEX modifies architectural state (Section 39.3). In particular, the processor establishes the
following architectural state as indicated:
• The following bits in RFLAGS are cleared: CF, PF, AF, ZF, SF, OF, and RF.
• FS and GS are restored to the values they had prior to the most recent enclave entry.
• RIP is loaded with the AEP of interrupted enclave thread.
• RSP is loaded from the URSP field in the enclave’s state-save area (SSA).
Vol. 3C 27-3
VM EXITS
3:0 B3 – B0. When set, each of these bits indicates that the corresponding breakpoint condition was met. Any of
these bits may be set even if its corresponding enabling bit in DR7 is not set.
13 BD. When set, this bit indicates that the cause of the debug exception is “debug register access detected.”
14 BS. When set, this bit indicates that the cause of the debug exception is either the execution of a single
instruction (if RFLAGS.TF = 1 and IA32_DEBUGCTL.BTF = 0) or a taken branch (if
RFLAGS.TF = DEBUGCTL.BTF = 1).
1. Bit 5 of the IA32_VMX_MISC MSR is read as 1 on any logical processor that supports the 1-setting of the “unrestricted guest” VM-
execution control.
2. Bit 31 of this field is set on certain VM-entry failures; see Section 26.8.
27-4 Vol. 3C
VM EXITS
16 RTM. When set, this bit indicates that a debug exception (#DB) or a breakpoint exception (#BP) occurred
inside an RTM region while advanced debugging of RTM transactional regions was enabled (see Section
16.3.7, “RTM-Enabled Debugger Support,” of the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 1).1
63:17 Reserved (cleared to 0). Bits 63:32 exist only on processors that support Intel 64 architecture.
NOTES:
1. In general, the format of this field matches that of DR6. However, DR6 clears bit 16 to indicate an RTM-related exception, while this
field sets the bit to indicate that condition.
— For a page-fault exception, the exit qualification contains the linear address that caused the page fault. On
processors that support Intel 64 architecture, bits 63:32 are cleared if the logical processor was not in 64-
bit mode before the VM exit.
If the page-fault exception occurred during execution of an instruction in enclave mode (and not during
delivery of an event incident to enclave mode), bits 11:0 of the exit qualification are cleared.
— For a start-up IPI (SIPI), the exit qualification contains the SIPI vector information in bits 7:0. Bits 63:8 of
the exit qualification are cleared to 0.
— For a task switch, the exit qualification contains details about the task switch, encoded as shown in
Table 27-2.
— For INVLPG, the exit qualification contains the linear-address operand of the instruction.
• On processors that support Intel 64 architecture, bits 63:32 are cleared if the logical processor was not
in 64-bit mode before the VM exit.
• If the INVLPG source operand specifies an unusable segment, the linear address specified in the exit
qualification will match the linear address that the INVLPG would have used if no VM exit occurred. This
address is not architecturally defined and may be implementation-specific.
15:0 Selector of task-state segment (TSS) to which the guest attempted to switch
63:32 Reserved (cleared to 0). These bits exist only on processors that support Intel 64 architecture.
— For INVEPT, INVPCID, INVVPID, LGDT, LIDT, LLDT, LTR, SGDT, SIDT, SLDT, STR, VMCLEAR, VMPTRLD,
VMPTRST, VMREAD, VMWRITE, VMXON, XRSTORS, and XSAVES, the exit qualification receives the value of
the instruction’s displacement field, which is sign-extended to 64 bits if necessary (32 bits on processors
that do not support Intel 64 architecture). If the instruction has no displacement (for example, has a
register operand), zero is stored into the exit qualification.
On processors that support Intel 64 architecture, an exception is made for RIP-relative addressing (used
only in 64-bit mode). Such addressing causes an instruction to use an address that is the sum of the
displacement field and the value of RIP that references the following instruction. In this case, the exit
qualification is loaded with the sum of the displacement field and the appropriate RIP value.
Vol. 3C 27-5
VM EXITS
In all cases, bits of this field beyond the instruction’s address size are undefined. For example, suppose
that the address-size field in the VM-exit instruction-information field (see Section 24.9.4 and Section
27.2.5) reports an n-bit address size. Then bits 63:n (bits 31:n on processors that do not support Intel 64
architecture) of the instruction displacement are undefined.
— For a control-register access, the exit qualification contains information about the access and has the
format given in Table 27-3.
— For MOV DR, the exit qualification contains information about the instruction and has the format given in
Table 27-4.
— For an I/O instruction, the exit qualification contains information about the instruction and has the format
given in Table 27-5.
— For MWAIT, the exit qualification contains a value that indicates whether address-range monitoring
hardware was armed. The exit qualification is set either to 0 (if address-range monitoring hardware is not
armed) or to 1 (if address-range monitoring hardware is armed).
— For an APIC-access VM exit resulting from a linear access or a guest-physical access to the APIC-access
page (see Section 29.4), the exit qualification contains information about the access and has the format
given in Table 27-6.1
If the access to the APIC-access page occurred during execution of an instruction in enclave mode (and not
during delivery of an event incident to enclave mode), bits 11:0 of the exit qualification are cleared.
Such a VM exit that set bits 15:12 of the exit qualification to 0000b (data read during instruction execution)
or 0001b (data write during instruction execution) set bit 12—which distinguishes data read from data
write—to that which would have been stored in bit 1—W/R—of the page-fault error code had the access
caused a page fault instead of an APIC-access VM exit. This implies the following:
• For an APIC-access VM exit caused by the CLFLUSH and CLFLUSHOPT instructions, the access type is
“data read during instruction execution.”
• For an APIC-access VM exit caused by the ENTER instruction, the access type is “data write during
instruction execution.”
3:0 Number of control register (0 for CLTS and LMSW). Bit 3 is always 0 on processors that do not support Intel 64
architecture as they do not support CR8.
7 Reserved (cleared to 0)
1. The exit qualification is undefined if the access was part of the logging of a branch record or a processor-event-based-sampling
(PEBS) record to the DS save area. It is recommended that software configure the paging structures so that no address in the DS
save area translates to an address on the APIC-access page.
27-6 Vol. 3C
VM EXITS
63:32 Reserved (cleared to 0). These bits exist only on processors that support Intel 64 architecture.
• For an APIC-access VM exit caused by the MASKMOVQ instruction or the MASKMOVDQU instruction, the
access type is “data write during instruction execution.”
• For an APIC-access VM exit caused by the MONITOR instruction, the access type is “data read during
instruction execution.”
Such a VM exit stores 1 for bit 31 for IDT-vectoring information field (see Section 27.2.4) if and only if it
sets bits 15:12 of the exit qualification to 0011b (linear access during event delivery) or 1010b (guest-
physical access during event delivery).
See Section 29.4.4 for further discussion of these instructions and APIC-access VM exits.
For APIC-access VM exits resulting from physical accesses to the APIC-access page (see Section 29.4.6),
the exit qualification is undefined.
— For an EPT violation, the exit qualification contains information about the access causing the EPT violation
and has the format given in Table 27-7.
As noted in that table, the format and meaning of the exit qualification depends on the setting of the
“mode-based execute control for EPT” VM-execution control and whether the processor supports advanced
VM-exit information for EPT violations.1
An EPT violation that occurs during as a result of execution of a read-modify-write operation sets bit 1 (data
write). Whether it also sets bit 0 (data read) is implementation-specific and, for a given implementation,
may differ for different kinds of read-modify-write operations.
3 Reserved (cleared to 0)
1. Software can determine whether advanced VM-exit information for EPT violations is supported by consulting the VMX capability
MSR IA32_VMX_EPT_VPID_CAP (see Appendix A.10).
Vol. 3C 27-7
VM EXITS
63:32 Reserved (cleared to 0). These bits exist only on processors that support Intel 64 architecture.
Table 27-6. Exit Qualification for APIC-Access VM Exits from Linear Accesses and Guest-Physical Accesses
11:0 • If the APIC-access VM exit is due to a linear access, the offset of access within the APIC page.
• Undefined if the APIC-access VM exit is due a guest-physical access
27-8 Vol. 3C
VM EXITS
Table 27-6. Exit Qualification for APIC-Access VM Exits from Linear Accesses and Guest-Physical Accesses (Contd.)
16 If the APIC-access VM exit is due to a guest-physical access, this bit is set if the access was asynchronous to
instruction execution and not part of event delivery. (The bit is set if the access is related to trace output by
Intel PT; see Section 25.5.4.) Otherwise, this bit is cleared.
63:17 Reserved (cleared to 0). Bits 63:32 exist only on processors that support Intel 64 architecture.
0 Set if the access causing the EPT violation was a data read.1
1 Set if the access causing the EPT violation was a data write.1
2 Set if the access causing the EPT violation was an instruction fetch.
1. Execution of WRMSR with ECX = 83FH (self-IPI MSR) can lead to an APIC-write VM exit; the exit qualification for such an APIC-write
VM exit is 3F0H.
Vol. 3C 27-9
VM EXITS
3 The logical-AND of bit 0 in the EPT paging-structure entries used to translate the guest-physical address of the
access causing the EPT violation (indicates whether the guest-physical address was readable).2
4 The logical-AND of bit 1 in the EPT paging-structure entries used to translate the guest-physical address of the
access causing the EPT violation (indicates whether the guest-physical address was writeable).
5 The logical-AND of bit 2 in the EPT paging-structure entries used to translate the guest-physical address of the
access causing the EPT violation.
If the “mode-based execute control for EPT” VM-execution control is 0, this indicates whether the guest-physical
address was executable. If that control is 1, this indicates whether the guest-physical address was executable
for supervisor-mode linear addresses.
6 If the “mode-based execute control” VM-execution control is 0, the value of this bit is undefined. If that control is
1, this bit is the logical-AND of bit 10 in the EPT paging-structure entries used to translate the guest-physical
address of the access causing the EPT violation. In this case, it indicates whether the guest-physical address was
executable for user-mode linear addresses.
8 If bit 7 is 1:
• Set if the access causing the EPT violation is to a guest-physical address that is the translation of a linear
address.
• Clear if the access causing the EPT violation is to a paging-structure entry as part of a page walk or the
update of an accessed or dirty bit.
Reserved if bit 7 is 0 (cleared to 0).
9 If bit 7 is 1, bit 8 is 1, and the processor supports advanced VM-exit information for EPT violations,3 this bit is 0
if the linear address is a supervisor-mode linear address and 1 if it is a user-mode linear address. (If CR0.PG = 0,
the translation of every linear address is a user-mode linear address and thus this bit will be 1.) Otherwise, this
bit is undefined.
10 If bit 7 is 1, bit 8 is 1, and the processor supports advanced VM-exit information for EPT violations,3 this bit is 0
if paging translates the linear address to a read-only page and 1 if it translates to a read/write page. (If CR0.PG =
0, every linear address is read/write and thus this bit will be 1.) Otherwise, this bit is undefined.
11 If bit 7 is 1, bit 8 is 1, and the processor supports advanced VM-exit information for EPT violations,3 this bit is 0
if paging translates the linear address to an executable page and 1 if it translates to an execute-disable page. (If
CR0.PG = 0, CR4.PAE = 0, or IA32_EFER.NXE = 0, every linear address is executable and thus this bit will be 0.)
Otherwise, this bit is undefined.
13 Set if the access causing the EPT violation was a shadow-stack access.
14 If supervisor shadow-stack control is enabled (by setting bit 7 of EPTP), this bit is the same as bit 60 in the EPT
paging-structure entry that maps the page of the guest-physical address of the access causing the EPT violation.
Otherwise (or if translation of the guest-physical address terminates before reaching an EPT paging-structure
entry that maps a page), this bit is undefined.
16 This bit is set if the access was asynchronous to instruction execution not the result of event delivery. (The bit is
set if the access is related to trace output by Intel PT; see Section 25.5.4.) Otherwise, this bit is cleared.
27-10 Vol. 3C
VM EXITS
NOTES:
1. If accessed and dirty flags for EPT are enabled, processor accesses to guest paging-structure entries are treated as writes with
regard to EPT violations (see Section 28.2.3.2). If such an access causes an EPT violation, the processor sets both bit 0 and bit 1 of
the exit qualification.
2. Bits 5:3 are cleared to 0 if any of EPT paging-structure entries used to translate the guest-physical address of the access causing the
EPT violation is not present (see Section 28.2.2).
3. Software can determine whether advanced VM-exit information for EPT violations is supported by consulting the VMX capability
MSR IA32_VMX_EPT_VPID_CAP (see Appendix A.10).
— VM exits due to EPT violations that set bit 7 of the exit qualification (see Table 27-7; these are all EPT
violations except those resulting from an attempt to load the PDPTEs as of execution of the MOV CR
instruction and those due to TAPT). The linear address may translate to the guest-physical address whose
access caused the EPT violation. Alternatively, translation of the linear address may reference a paging-
structure entry whose access caused the EPT violation. Bits 63:32 are cleared if the logical processor was
not in 64-bit mode before the VM exit.
If the EPT violation occurred during execution of an instruction in enclave mode (and not during delivery of
an event incident to enclave mode), bits 11:0 of this field are cleared.
— VM exits due to SPP-related events.
— For all other VM exits, the field is undefined.
• Guest-physical address. For a VM exit due to an EPT violation, an EPT misconfiguration, or an SPP-related
event, this field receives the guest-physical address that caused the EPT violation or EPT misconfiguration. For
all other VM exits, the field is undefined.
If the EPT violation or EPT misconfiguration occurred during execution of an instruction in enclave mode (and
not during delivery of an event incident to enclave mode), bits 11:0 of this field are cleared.
1. INT1 and INT3 refer to the instructions with opcodes F1 and CC, respectively, and not to INT n with value 1 or 3 for n.
Vol. 3C 27-11
VM EXITS
— Bit 11 is set to 1 if the VM exit is caused by a hardware exception that would have delivered an error code
on the stack. This bit is always 0 if the VM exit occurred while the logical processor was in real-address
mode (CR0.PE=0).1 If bit 11 is set to 1, the error code is placed in the VM-exit interruption error code (see
below).
— Bit 12 reports “NMI unblocking due to IRET”; see Section 27.2.3. The value of this bit is undefined if the
VM exit is due to a double fault (the interruption type is hardware exception and the vector is 8).
— Bits 30:13 are always set to 0.
— Bit 31 is always set to 1.
For other VM exits (including those due to external interrupts when the “acknowledge interrupt on exit” VM-exit
control is 0), the field is marked invalid (by clearing bit 31) and the remainder of the field is undefined.
• VM-exit interruption error code.
— For VM exits that set both bit 31 (valid) and bit 11 (error code valid) in the VM-exit interruption-information
field, this field receives the error code that would have been pushed on the stack had the event causing the
VM exit been delivered normally through the IDT. The EXT bit is set in this field exactly when it would be set
normally. For exceptions that occur during the delivery of double fault (if the IDT-vectoring information field
indicates a double fault), the EXT bit is set to 1, assuming that (1) that the exception would produce an
error code normally (if not incident to double-fault delivery) and (2) that the error code uses the EXT bit
(not for page faults, which use a different format).
— For other VM exits, the value of this field is undefined.
1. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PE must be 1 in VMX operation, a logical processor cannot be in real-
address mode unless the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution con-
trols are both 1.
27-12 Vol. 3C
VM EXITS
1. This includes the case in which a VM exit occurs while delivering a software interrupt (INT n) through the 16-bit IVT (interrupt vec-
tor table) that is used in virtual-8086 mode with virtual-machine extensions (if RFLAGS.VM = CR4.VME = 1).
2. In the following items, INT1 and INT3 refer to the instructions with opcodes F1 and CC, respectively, and not to INT n with value 1 or
3 for n.
3. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PE must be 1 in VMX operation, a logical processor cannot be in real-
address mode unless the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution con-
trols are both 1.
Vol. 3C 27-13
VM EXITS
1. This item applies only to fault-like VM exits. It does not apply to trap-like VM exits following executions of the MOV to CR8 instruc-
tion when the “use TPR shadow” VM-execution control is 1 or to those following executions of the WRMSR instruction when the
“virtualize x2APIC mode” VM-execution control is 1.
2. The VM-exit instruction-length field is not defined following APIC-access VM exits resulting from physical accesses (see Section
29.4.6) even if encountered during delivery of a software interrupt, privileged software exception, or software exception.
27-14 Vol. 3C
VM EXITS
The cases of VM exits encountered during delivery of a software interrupt, privileged software exception, or
software exception include those encountered during delivery of events injected as part of VM entry (see
Section 26.6.1.2). If the original event was injected as part of VM entry, this field receives the value of the VM-
entry instruction length.
All VM exits other than those listed in the above items leave this field undefined.
If the VM exit occurred in enclave mode, this field is cleared (none of the previous items apply).
Table 27-8. Format of the VM-Exit Instruction-Information Field as Used for INS and OUTS
Bit Position(s) Content
6:0 Undefined.
9:7 Address size:
0: 16-bit
1: 32-bit
2: 64-bit (used only on processors that support Intel 64 architecture)
Other values not used.
14:10 Undefined.
17:15 Segment register:
0: ES
1: CS
2: SS
3: DS
4: FS
5: GS
Other values not used. Undefined for VM exits due to execution of INS.
31:18 Undefined.
• VM-exit instruction information. For VM exits due to attempts to execute INS, INVEPT, INVPCID, INVVPID,
LIDT, LGDT, LLDT, LTR, OUTS, RDRAND, RDSEED, SIDT, SGDT, SLDT, STR, VMCLEAR, VMPTRLD, VMPTRST,
VMREAD, VMWRITE, VMXON, XRSTORS, or XSAVES, this field receives information about the instruction that
caused the VM exit. The format of the field depends on the identity of the instruction causing the VM exit:
— For VM exits due to attempts to execute INS or OUTS, the field has the format is given in Table 27-8.1
— For VM exits due to attempts to execute INVEPT, INVPCID, or INVVPID, the field has the format is given in
Table 27-9.
— For VM exits due to attempts to execute LIDT, LGDT, SIDT, or SGDT, the field has the format is given in
Table 27-10.
— For VM exits due to attempts to execute LLDT, LTR, SLDT, or STR, the field has the format is given in
Table 27-11.
— For VM exits due to attempts to execute RDRAND, RDSEED, TPAUSE, or UMWAIT, the field has the format
is given in Table 27-12.
— For VM exits due to attempts to execute VMCLEAR, VMPTRLD, VMPTRST, VMXON, XRSTORS, or XSAVES,
the field has the format is given in Table 27-13.
— For VM exits due to attempts to execute VMREAD or VMWRITE, the field has the format is given in
Table 27-14.
For all other VM exits, the field is undefined, unless the VM exit occurred in enclave mode, in which case the
field is cleared.
1. The format of the field was undefined for these VM exits on the first processors to support the virtual-machine extensions. Soft-
ware can determine whether the format specified in Table 27-8 is used by consulting the VMX capability MSR IA32_VMX_BASIC
(see Appendix A.1).
Vol. 3C 27-15
VM EXITS
• I/O RCX, I/O RSI, I/O RDI, I/O RIP. These fields are undefined except for SMM VM exits due to system-
management interrupts (SMIs) that arrive immediately after retirement of I/O instructions. See Section
34.15.2.3. Note that, if the VM exit occurred in enclave mode, these fields are all cleared.
Table 27-9. Format of the VM-Exit Instruction-Information Field as Used for INVEPT, INVPCID, and INVVPID
Bit Position(s) Content
1:0 Scaling:
0: no scaling
1: scale by 2
2: scale by 4
3: scale by 8 (used only on processors that support Intel 64 architecture)
Undefined for instructions with no index register (bit 22 is set).
6:2 Undefined.
9:7 Address size:
0: 16-bit
1: 32-bit
2: 64-bit (used only on processors that support Intel 64 architecture)
Other values not used.
10 Cleared to 0.
14:11 Undefined.
17:15 Segment register:
0: ES
1: CS
2: SS
3: DS
4: FS
5: GS
Other values not used.
21:18 IndexReg:
0 = RAX
1 = RCX
2 = RDX
3 = RBX
4 = RSP
5 = RBP
6 = RSI
7 = RDI
8–15 represent R8–R15, respectively (used only on processors that support Intel 64 architecture)
Undefined for instructions with no index register (bit 22 is set).
22 IndexReg invalid (0 = valid; 1 = invalid)
26:23 BaseReg (encoded as IndexReg above)
Undefined for memory instructions with no base register (bit 27 is set).
27 BaseReg invalid (0 = valid; 1 = invalid)
31:28 Reg2 (same encoding as IndexReg above)
27-16 Vol. 3C
VM EXITS
Table 27-10. Format of the VM-Exit Instruction-Information Field as Used for LIDT, LGDT, SIDT, or SGDT
Bit Position(s) Content
1:0 Scaling:
0: no scaling
1: scale by 2
2: scale by 4
3: scale by 8 (used only on processors that support Intel 64 architecture)
Undefined for instructions with no index register (bit 22 is set).
6:2 Undefined.
9:7 Address size:
0: 16-bit
1: 32-bit
2: 64-bit (used only on processors that support Intel 64 architecture)
Other values not used.
10 Cleared to 0.
11 Operand size:
0: 16-bit
1: 32-bit
Undefined for VM exits from 64-bit mode.
14:12 Undefined.
17:15 Segment register:
0: ES
1: CS
2: SS
3: DS
4: FS
5: GS
Other values not used.
21:18 IndexReg:
0 = RAX
1 = RCX
2 = RDX
3 = RBX
4 = RSP
5 = RBP
6 = RSI
7 = RDI
8–15 represent R8–R15, respectively (used only on processors that support Intel 64 architecture)
Undefined for instructions with no index register (bit 22 is set).
22 IndexReg invalid (0 = valid; 1 = invalid)
26:23 BaseReg (encoded as IndexReg above)
Undefined for instructions with no base register (bit 27 is set).
27 BaseReg invalid (0 = valid; 1 = invalid)
29:28 Instruction identity:
0: SGDT
1: SIDT
2: LGDT
3: LIDT
Vol. 3C 27-17
VM EXITS
Table 27-10. Format of the VM-Exit Instruction-Information Field as Used for LIDT, LGDT, SIDT, or SGDT (Contd.)
Bit Position(s) Content
31:30 Undefined.
Table 27-11. Format of the VM-Exit Instruction-Information Field as Used for LLDT, LTR, SLDT, and STR
Bit Position(s) Content
1:0 Scaling:
0: no scaling
1: scale by 2
2: scale by 4
3: scale by 8 (used only on processors that support Intel 64 architecture)
Undefined for register instructions (bit 10 is set) and for memory instructions with no index register (bit 10 is clear
and bit 22 is set).
2 Undefined.
6:3 Reg1:
0 = RAX
1 = RCX
2 = RDX
3 = RBX
4 = RSP
5 = RBP
6 = RSI
7 = RDI
8–15 represent R8–R15, respectively (used only on processors that support Intel 64 architecture)
Undefined for memory instructions (bit 10 is clear).
9:7 Address size:
0: 16-bit
1: 32-bit
2: 64-bit (used only on processors that support Intel 64 architecture)
Other values not used. Undefined for register instructions (bit 10 is set).
10 Mem/Reg (0 = memory; 1 = register).
14:11 Undefined.
17:15 Segment register:
0: ES
1: CS
2: SS
3: DS
4: FS
5: GS
Other values not used. Undefined for register instructions (bit 10 is set).
21:18 IndexReg (encoded as Reg1 above)
Undefined for register instructions (bit 10 is set) and for memory instructions with no index register (bit 10 is clear
and bit 22 is set).
22 IndexReg invalid (0 = valid; 1 = invalid)
Undefined for register instructions (bit 10 is set).
26:23 BaseReg (encoded as Reg1 above)
Undefined for register instructions (bit 10 is set) and for memory instructions with no base register (bit 10 is clear
and bit 27 is set).
27-18 Vol. 3C
VM EXITS
Table 27-11. Format of the VM-Exit Instruction-Information Field as Used for LLDT, LTR, SLDT, and STR (Contd.)
Bit Position(s) Content
27 BaseReg invalid (0 = valid; 1 = invalid)
Undefined for register instructions (bit 10 is set).
29:28 Instruction identity:
0: SLDT
1: STR
2: LLDT
3: LTR
31:30 Undefined.
Table 27-12. Format of the VM-Exit Instruction-Information Field as Used for RDRAND, RDSEED, TPAUSE, and
UMWAIT
Bit Position(s) Content
2:0 Undefined.
6:3 Operand register (destination for RDRAND and RDSEED; source for TPAUSE and UMWAIT):
0 = RAX
1 = RCX
2 = RDX
3 = RBX
4 = RSP
5 = RBP
6 = RSI
7 = RDI
8–15 represent R8–R15, respectively (used only on processors that support Intel 64 architecture)
10:7 Undefined.
12:11 Operand size:
0: 16-bit
1: 32-bit
2: 64-bit
The value 3 is not used.
31:13 Undefined.
Table 27-13. Format of the VM-Exit Instruction-Information Field as Used for VMCLEAR, VMPTRLD, VMPTRST,
VMXON, XRSTORS, and XSAVES
Bit Position(s) Content
1:0 Scaling:
0: no scaling
1: scale by 2
2: scale by 4
3: scale by 8 (used only on processors that support Intel 64 architecture)
Undefined for instructions with no index register (bit 22 is set).
6:2 Undefined.
Vol. 3C 27-19
VM EXITS
Table 27-13. Format of the VM-Exit Instruction-Information Field as Used for VMCLEAR, VMPTRLD, VMPTRST,
VMXON, XRSTORS, and XSAVES (Contd.)
Bit Position(s) Content
9:7 Address size:
0: 16-bit
1: 32-bit
2: 64-bit (used only on processors that support Intel 64 architecture)
Other values not used.
10 Cleared to 0.
14:11 Undefined.
17:15 Segment register:
0: ES
1: CS
2: SS
3: DS
4: FS
5: GS
Other values not used.
21:18 IndexReg:
0 = RAX
1 = RCX
2 = RDX
3 = RBX
4 = RSP
5 = RBP
6 = RSI
7 = RDI
8–15 represent R8–R15, respectively (used only on processors that support Intel 64 architecture)
Undefined for instructions with no index register (bit 22 is set).
22 IndexReg invalid (0 = valid; 1 = invalid)
26:23 BaseReg (encoded as IndexReg above)
Undefined for instructions with no base register (bit 27 is set).
27 BaseReg invalid (0 = valid; 1 = invalid)
31:28 Undefined.
Table 27-14. Format of the VM-Exit Instruction-Information Field as Used for VMREAD and VMWRITE
Bit Position(s) Content
1:0 Scaling:
0: no scaling
1: scale by 2
2: scale by 4
3: scale by 8 (used only on processors that support Intel 64 architecture)
Undefined for register instructions (bit 10 is set) and for memory instructions with no index register (bit 10 is clear
and bit 22 is set).
2 Undefined.
27-20 Vol. 3C
VM EXITS
Table 27-14. Format of the VM-Exit Instruction-Information Field as Used for VMREAD and VMWRITE (Contd.)
Bit Position(s) Content
6:3 Reg1:
0 = RAX
1 = RCX
2 = RDX
3 = RBX
4 = RSP
5 = RBP
6 = RSI
7 = RDI
8–15 represent R8–R15, respectively (used only on processors that support Intel 64 architecture)
Undefined for memory instructions (bit 10 is clear).
9:7 Address size:
0: 16-bit
1: 32-bit
2: 64-bit (used only on processors that support Intel 64 architecture)
Other values not used. Undefined for register instructions (bit 10 is set).
10 Mem/Reg (0 = memory; 1 = register).
14:11 Undefined.
17:15 Segment register:
0: ES
1: CS
2: SS
3: DS
4: FS
5: GS
Other values not used. Undefined for register instructions (bit 10 is set).
21:18 IndexReg (encoded as Reg1 above)
Undefined for register instructions (bit 10 is set) and for memory instructions with no index register (bit 10 is clear
and bit 22 is set).
22 IndexReg invalid (0 = valid; 1 = invalid)
Undefined for register instructions (bit 10 is set).
26:23 BaseReg (encoded as Reg1 above)
Undefined for register instructions (bit 10 is set) and for memory instructions with no base register (bit 10 is clear
and bit 27 is set).
27 BaseReg invalid (0 = valid; 1 = invalid)
Undefined for register instructions (bit 10 is set).
31:28 Reg2 (same encoding as Reg1 above)
Vol. 3C 27-21
VM EXITS
27-22 Vol. 3C
VM EXITS
1. The reference here is to the full value of RIP before any truncation that would occur had the stack width been only 32 bits or 16
bits.
2. The reference here is to the full value of RFLAGS before any truncation that would occur had the stack width been only 32 bits or
16 bits.
Vol. 3C 27-23
VM EXITS
the event been delivered through a task gate) had the event been delivered through the IDT. See below for
VM exits due to task switches caused by task gates in the IDT.
— If the VM exit is caused by a triple fault, the value saved is that which the logical processor would have in
RF in the RFLAGS register had the triple fault taken the logical processor to the shutdown state.
— If the VM exit is caused by a task switch (including one caused by a task gate in the IDT), the value saved
is that which would have been saved in the RFLAGS image in the old task-state segment (TSS) had the task
switch completed normally without exception.
— If the VM exit is caused by an attempt to execute an instruction that unconditionally causes VM exits or one
that was configured to do with a VM-execution control, the value saved is 0.1
— For APIC-access VM exits and for VM exits caused by EPT violations, EPT misconfigurations, page-modifi-
cation log-full events, or SPP-related events, the value saved depends on whether the VM exit occurred
during delivery of an event through the IDT:
• If the VM exit stored 0 for bit 31 for IDT-vectoring information field (because the VM exit did not occur
during delivery of an event through the IDT; see Section 27.2.4), the value saved is 1.
• If the VM exit stored 1 for bit 31 for IDT-vectoring information field (because the VM exit did occur
during delivery of an event through the IDT), the value saved is the value that would have appeared in
the saved RFLAGS image had the event been delivered through the IDT (see above).
— For all other VM exits, the value saved is the value RFLAGS.RF had before the VM exit occurred.
• If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the SSP register are
saved into the SSP field.
1. This is true even if RFLAGS.RF was 1 before the instruction was executed. If, in response to such a VM exit, a VM monitor re-enters
the guest to re-execute the instruction that caused the VM exit (for example, after clearing the VM-execution control that caused
the VM exit), the instruction may encounter a code breakpoint that has already been processed. A VM monitor can avoid this by set-
ting the guest value of RFLAGS.RF to 1 before resuming guest software.
2. If this activity state was an inactive state resulting from execution of a specific instruction (HLT or MWAIT), the value saved for RIP
by that VM exit will reference the following instruction.
27-24 Vol. 3C
VM EXITS
— A VM exit with basic exit reason “TPR below threshold”,1 “virtualized EOI”, “APIC write”, or “monitor trap
flag.”
— A VM exit due to trace-address pre-translation (TAPT; see Section 25.5.4). Such VM exits can have basic
exit reason “APIC access,” “EPT violation,” “EPT misconfiguration,” “page-modification log full,” or “SPP-
related event.” When due to TAPT, these VM exits (with the exception of those due to EPT misconfigura-
tions) set bit 16 of the exit qualification, indicating that they are asynchronous to instruction execution and
not part of event delivery.
— VM exits that are not caused by debug exceptions and that occur while there is MOV-SS blocking of debug
exceptions.
For VM exits that do not clear the field, the value saved is determined as follows:
— Each of bits 3:0 may be set if it corresponds to a matched breakpoint. This may be true even if the corre-
sponding breakpoint is not enabled in DR7.
— Suppose that a VM exit is due to an INIT signal, a machine-check exception, or an SMI; or that a VM exit
has basic exit reason “TPR below threshold” or “monitor trap flag.” In this case, the value saved sets bits
corresponding to the causes of any debug exceptions that were pending at the time of the VM exit.
If the VM exit occurs immediately after VM entry, the value saved may match that which was loaded on
VM entry (see Section 26.7.3). Otherwise, the following items apply:
• Bit 12 (enabled breakpoint) is set to 1 in any of the following cases:
— If there was at least one matched data or I/O breakpoint that was enabled in DR7.
— If it had been set on VM entry, causing there to be valid pending debug exceptions (see Section
26.7.3) and the VM exit occurred before those exceptions were either delivered or lost.
— If the XBEGIN instruction was executed immediately before the VM exit and advanced debugging of
RTM transactional regions had been enabled (see Section 16.3.7, “RTM-Enabled Debugger
Support,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1). (This does
not apply to VM exits with basic exit reason “monitor trap flag.”)
In other cases, bit 12 is cleared to 0.
• Bit 14 (BS) is set if RFLAGS.TF = 1 in either of the following cases:
— IA32_DEBUGCTL.BTF = 0 and the cause of a pending debug exception was the execution of a single
instruction.
— IA32_DEBUGCTL.BTF = 1 and the cause of a pending debug exception was a taken branch.
• Bit 16 (RTM) is set if a debug exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM
region while advanced debugging of RTM transactional regions had been enabled. (This does not apply
to VM exits with basic exit reason “monitor trap flag.”)
— Suppose that a VM exit is due to another reason (but not a debug exception) and occurs while there is MOV-
SS blocking of debug exceptions. In this case, the value saved sets bits corresponding to the causes of any
debug exceptions that were pending at the time of the VM exit. If the VM exit occurs immediately after
VM entry (no instructions were executed in VMX non-root operation), the value saved may match that
which was loaded on VM entry (see Section 26.7.3). Otherwise, the following items apply:
• Bit 12 (enabled breakpoint) is set to 1 if there was at least one matched data or I/O breakpoint that was
enabled in DR7. Bit 12 is also set if it had been set on VM entry, causing there to be valid pending debug
exceptions (see Section 26.7.3) and the VM exit occurred before those exceptions were either delivered
or lost. In other cases, bit 12 is cleared to 0.
• The setting of bit 14 (BS) is implementation-specific. However, it is not set if RFLAGS.TF = 0 or
IA32_DEBUGCTL.BTF = 1.
— The reserved bits in the field are cleared.
• If the “save VMX-preemption timer value” VM-exit control is 1, the value of timer is saved into the VMX-
preemption timer-value field. This is the value loaded from this field on VM entry as subsequently decremented
(see Section 25.5.1). VM exits due to timer expiration save the value 0. Other VM exits may also save the value
1. This item includes VM exits that occur as a result of certain VM entries (Section 26.7.7).
Vol. 3C 27-25
VM EXITS
0 if the timer expired during VM exit. (If the “save VMX-preemption timer value” VM-exit control is 0, VM exit
does not modify the value of the VMX-preemption timer-value field.)
• If the logical processor supports the 1-setting of the “enable EPT” VM-execution control, values are saved into
the four (4) PDPTE fields as follows:
— If the “enable EPT” VM-execution control is 1 and the logical processor was using PAE paging at the time of
the VM exit, the PDPTE values currently in use are saved:1
• The values saved into bits 11:9 of each of the fields is undefined.
• If the value saved into one of the fields has bit 0 (present) clear, the value saved into bits 63:1 of that
field is undefined. That value need not correspond to the value that was loaded by VM entry or to any
value that might have been loaded in VMX non-root operation.
• If the value saved into one of the fields has bit 0 (present) set, the value saved into bits 63:12 of the
field is a guest-physical address.
— If the “enable EPT” VM-execution control is 0 or the logical processor was not using PAE paging at the time
of the VM exit, the values saved are undefined.
1. A logical processor uses PAE paging if CR0.PG = 1, CR4.PAE = 1 and IA32_EFER.LMA = 0. See Section 4.4 in the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3A. “Enable EPT” is a secondary processor-based VM-execution control. If bit 31
of the primary processor-based VM-execution controls is 0, VM exit functions as if the “enable EPT” VM-execution control were 0.
See Section 24.6.2.
27-26 Vol. 3C
VM EXITS
On processors that support Intel 64 architecture, the full values of each 64-bit field loaded (for example, the base
address for GDTR) is loaded regardless of the mode of the logical processor before and after the VM exit.
The loading of host state is detailed in Section 27.5.1 to Section 27.5.5. These sections reference VMCS fields that
correspond to processor state. Unless otherwise stated, these references are to fields in the host-state area.
A logical processor is in IA-32e mode after a VM exit only if the “host address-space size” VM-exit control is 1. If
the logical processor was in IA-32e mode before the VM exit and this control is 0, a VMX abort occurs. See Section
27.7.
In addition to loading host state, VM exits clear address-range monitoring (Section 27.5.6).
After the state loading described in this section, VM exits may load MSRs from the VM-exit MSR-load area (see
Section 27.6). This loading occurs only after the state loading described in this section.
1. Bits 28:19, 17, and 15:6 of CR0 and CR0.ET are unchanged by executions of MOV to CR0. CR0.ET is always 1 and the other bits are
always 0.
2. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
3. Software can determine the number N by executing CPUID with 80000008H in EAX. The number of linear-address bits supported is
returned in bits 15:8 of EAX.
Vol. 3C 27-27
VM EXITS
1. Software can determine the number N by executing CPUID with 80000008H in EAX. The number of linear-address bits supported is
returned in bits 15:8 of EAX.
27-28 Vol. 3C
VM EXITS
— SS, DS, ES, FS, and GS. Undefined if the segment is unusable; otherwise, type set to 3 and S set to 1
(read/write, accessed, expand-up data segment).
— TR. Type set to 11 and S set to 0 (busy 32-bit task-state segment).
• The DPL is set as follows:
— CS, SS, and TR. Set to 0. The current privilege level (CPL) will be 0 after the VM exit completes.
— DS, ES, FS, and GS. Undefined if the segment is unusable; otherwise, set to 0.
• The P bit is set as follows:
— CS, TR. Set to 1.
— SS, DS, ES, FS, and GS. Undefined if the segment is unusable; otherwise, set to 1.
• On processors that support Intel 64 architecture, CS.L is loaded with the setting of the “host address-space
size” VM-exit control. Because the value of this control is also loaded into IA32_EFER.LMA (see Section 27.5.1),
no VM exit is ever to compatibility mode (which requires IA32_EFER.LMA = 1 and CS.L = 0).
• D/B.
— CS. Loaded with the inverse of the setting of the “host address-space size” VM-exit control. For example, if
that control is 0, indicating a 32-bit guest, CS.D/B is set to 1.
— SS. Set to 1.
— DS, ES, FS, and GS. Undefined if the segment is unusable; otherwise, set to 1.
— TR. Set to 0.
• G.
— CS. Set to 1.
— SS, DS, ES, FS, and GS. Undefined if the segment is unusable; otherwise, set to 1.
— TR. Set to 0.
The host-state area does not contain a selector field for LDTR. LDTR is established as follows on all VM exits: the
selector is cleared to 0000H, the segment is marked unusable and is otherwise undefined (although the base
address is always canonical).
The base addresses for GDTR and IDTR are loaded from the GDTR base-address field and the IDTR base-address
field, respectively. If the processor supports the Intel 64 architecture and the processor supports N < 64 linear-
address bits, each of bits 63:N of each base address is set to the value of bit N–1 of that base address. The GDTR
and IDTR limits are each set to FFFFH.
1. On processors that support Intel 64 architecture, the physical-address extension may support more than 36 physical-address bits.
Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
Vol. 3C 27-29
VM EXITS
A VM exit is to a VMM that uses PAE paging if (1) bit 5 (corresponding to CR4.PAE) is set in the CR4 field in the host-
state area of the VMCS; and (2) the “host address-space size” VM-exit control is 0. Such a VM exit may check the
validity of the PDPTEs referenced by the CR3 field in the host-state area of the VMCS. Such a VM exit must check
their validity if either (1) PAE paging was not in use before the VM exit; or (2) the value of CR3 is changing as a
result of the VM exit. A VM exit to a VMM that does not use PAE paging must not check the validity of the PDPTEs.
A VM exit that checks the validity of the PDPTEs uses the same checks that are used when CR3 is loaded with
MOV to CR3 when PAE paging is in use. If MOV to CR3 would cause a general-protection exception due to the
PDPTEs that would be loaded (e.g., because a reserved bit is set), a VMX abort occurs (see Section 27.7). If a
VM exit to a VMM that uses PAE does not cause a VMX abort, the PDPTEs are loaded into the processor as would
MOV to CR3, using the value of CR3 being load by the VM exit.
27-30 Vol. 3C
VM EXITS
specific behavior is documented in Chapter 2, “Model-Specific Registers (MSRs)” in the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 4.
• Bits 63:32 are not all 0.
• An attempt to write bits 127:64 to the MSR indexed by bits 31:0 of the entry would cause a general-protection
exception if executed via WRMSR with CPL = 0.1
If processing fails for any entry, a VMX abort occurs. See Section 27.7.
If any MSR is being loaded in such a way that would architecturally require a TLB flush, the TLBs are updated so
that, after VM exit, the logical processor does not use any translations that were cached before the transition.
1. Note the following about processors that support Intel 64 architecture. If CR0.PG = 1, WRMSR to the IA32_EFER MSR causes a gen-
eral-protection exception if it would modify the LME bit. Since CR0.PG is always 1 in VMX operation, the IA32_EFER MSR should not
be included in the VM-exit MSR-load area for the purpose of modifying the LME bit.
2. A logical processor is in SMX operation if GETSEC[SEXIT] has not been executed since the last execution of GETSEC[SENTER]. A logi-
cal processor is outside SMX operation if GETSEC[SENTER] has not been executed or if GETSEC[SEXIT] was executed after the last
execution of GETSEC[SENTER]. See Chapter 6, “Safer Mode Extensions Reference,” in Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2B.
Vol. 3C 27-31
VM EXITS
INIT signals; external interrupts; non-maskable interrupts (NMIs); start-up IPIs (SIPIs); and system-
management interrupts (SMIs).
1. A logical processor is in SMX operation if GETSEC[SEXIT] has not been executed since the last execution of GETSEC[SENTER]. A logi-
cal processor is outside SMX operation if GETSEC[SENTER] has not been executed or if GETSEC[SEXIT] was executed after the last
execution of GETSEC[SENTER]. See Chapter 6, “Safer Mode Extensions Reference,” in Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2B.
27-32 Vol. 3C
23.Updates to Chapter 28, Volume 3C
Change bars show changes to Chapter 28 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3C: System Programming Guide, Part 3.
------------------------------------------------------------------------------------------
Changes to this chapter: Added Control-flow Enforcement Technology details.
The architecture for VMX operation includes two features that support address translation: virtual-processor iden-
tifiers (VPIDs) and the extended page-table mechanism (EPT). VPIDs are a mechanism for managing translations
of linear addresses. EPT defines a layer of address translation that augments the translation of linear addresses.
Section 28.1 details the architecture of VPIDs. Section 28.2 provides the details of EPT. Section 28.3 explains how
a logical processor may cache information from the paging structures, how it may use that cached information, and
how software can managed the cached information.
Vol. 3C 28-1
VMX SUPPORT FOR ADDRESS TRANSLATION
The translation from guest-physical addresses to physical addresses is determined by a set of EPT paging struc-
tures. The EPT paging structures are similar to those used to translate linear addresses while the processor is in
IA-32e mode. Section 28.2.2 gives the details of the EPT paging structures.
If CR0.PG = 1, linear addresses are translated through paging structures referenced through control register CR3.
While the “enable EPT” VM-execution control is 1, these are called guest paging structures. There are no guest
paging structures if CR0.PG = 0.1
When the “enable EPT” VM-execution control is 1, the identity of guest-physical addresses depends on the value
of CR0.PG:
• If CR0.PG = 0, each linear address is treated as a guest-physical address.
• If CR0.PG = 1, guest-physical addresses are those derived from the contents of control register CR3 and the
guest paging structures. (This includes the values of the PDPTEs, which logical processors store in internal,
non-architectural registers.) The latter includes (in page-table entries and in other paging-structure entries for
which bit 7—PS—is 1) the addresses to which linear addresses are translated by the guest paging structures.
If CR0.PG = 1, the translation of a linear address to a physical address requires multiple translations of guest-phys-
ical addresses using EPT. Assume, for example, that CR4.PAE = CR4.PSE = 0. The translation of a 32-bit linear
address then operates as follows:
• Bits 31:22 of the linear address select an entry in the guest page directory located at the guest-physical
address in CR3. The guest-physical address of the guest page-directory entry (PDE) is translated through EPT
to determine the guest PDE’s physical address.
• Bits 21:12 of the linear address select an entry in the guest page table located at the guest-physical address in
the guest PDE. The guest-physical address of the guest page-table entry (PTE) is translated through EPT to
determine the guest PTE’s physical address.
• Bits 11:0 of the linear address is the offset in the page frame located at the guest-physical address in the guest
PTE. The guest-physical address determined by this offset is translated through EPT to determine the physical
address to which the original linear address translates.
In addition to translating a guest-physical address to a physical address, EPT specifies the privileges that software
is allowed when accessing the address. Attempts at disallowed accesses are called EPT violations and cause
VM exits. See Section 28.2.3.
A processor uses EPT to translate guest-physical addresses only when those addresses are used to access memory.
This principle implies the following:
• The MOV to CR3 instruction loads CR3 with a guest-physical address. Whether that address is translated
through EPT depends on whether PAE paging is being used.2
— If PAE paging is not being used, the instruction does not use that address to access memory and does not
cause it to be translated through EPT. (If CR0.PG = 1, the address will be translated through EPT on the
next memory accessing using a linear address.)
— If PAE paging is being used, the instruction loads the four (4) page-directory-pointer-table entries (PDPTEs)
from that address and it does cause the address to be translated through EPT.
• Section 4.4.1 identifies executions of MOV to CR0 and MOV to CR4 that load the PDPTEs from the guest-
physical address in CR3. Such executions cause that address to be translated through EPT.
• The PDPTEs contain guest-physical addresses. The instructions that load the PDPTEs (see above) do not use
those addresses to access memory and do not cause them to be translated through EPT. The address in a
PDPTE will be translated through EPT on the next memory accessing using a linear address that uses that
PDPTE.
1. “Enable EPT” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution controls
is 0, the logical processor operates as if the “enable EPT” VM-execution control were 0. See Section 24.6.2.
1. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PG must be 1 in VMX operation, CR0.PG can be 0 in VMX non-root
operation only if the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution controls are
both 1.
2. A logical processor uses PAE paging if CR0.PG = 1, CR4.PAE = 1 and IA32_EFER.LMA = 0. See Section 4.4 in the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3A.
28-2 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
Table 28-1. Format of an EPT PML4 Entry (PML4E) that References an EPT Page-Directory-Pointer Table
Bit Contents
Position(s)
0 Read access; indicates whether reads are allowed from the 512-GByte region controlled by this entry
1 Write access; indicates whether writes are allowed to the 512-GByte region controlled by this entry
2 If the “mode-based execute control for EPT” VM-execution control is 0, execute access; indicates whether instruction
fetches are allowed from the 512-GByte region controlled by this entry
If that control is 1, execute access for supervisor-mode linear addresses; indicates whether instruction fetches are
allowed from supervisor-mode linear addresses in the 512-GByte region controlled by this entry
8 If bit 6 of EPTP is 1, accessed flag for EPT; indicates whether software has accessed the 512-GByte region
controlled by this entry (see Section 28.2.5). Ignored if bit 6 of EPTP is 0
9 Ignored
10 Execute access for user-mode linear addresses. If the “mode-based execute control for EPT” VM-execution control is
1, indicates whether instruction fetches are allowed from user-mode linear addresses in the 512-GByte region
controlled by this entry. If that control is 0, this bit is ignored.
11 Ignored
(N–1):12 Physical address of 4-KByte aligned EPT page-directory-pointer table referenced by this entry1
63:52 Ignored
1. No processors supporting the Intel 64 architecture support more than 48 physical-address bits. Thus, no such processor can pro-
duce a guest-physical address with more than 48 bits. An attempt to use such an address causes a page fault. An attempt to load
CR3 with such an address causes a general-protection fault. If PAE paging is being used, an attempt to load CR3 that would load a
PDPTE with such an address causes a general-protection fault.
2. Future processors may include support for other EPT page-walk lengths. Software should read the VMX capability MSR
IA32_VMX_EPT_VPID_CAP (see Appendix A.10) to determine what EPT page-walk lengths are supported.
Vol. 3C 28-3
VMX SUPPORT FOR ADDRESS TRANSLATION
NOTES:
1. N is the physical-address width supported by the processor. Software can determine a processor’s physical-address width by execut-
ing CPUID with 80000008H in EAX. The physical-address width is returned in bits 7:0 of EAX.
• A 4-KByte naturally aligned EPT page-directory-pointer table is located at the physical address specified in
bits 51:12 of the EPT PML4E. An EPT page-directory-pointer table comprises 512 64-bit entries (EPT PDPTEs).
An EPT PDPTE is selected using the physical address defined as follows:
— Bits 63:52 are all 0.
— Bits 51:12 are from the EPT PML4E.
— Bits 11:3 are bits 38:30 of the guest-physical address.
— Bits 2:0 are all 0.
Because an EPT PDPTE is identified using bits 47:30 of the guest-physical address, it controls access to a 1-GByte
region of the guest-physical-address space. Use of the EPT PDPTE depends on the value of bit 7 in that entry:1
• If bit 7 of the EPT PDPTE is 1, the EPT PDPTE maps a 1-GByte page. The final physical address is computed as
follows:
— Bits 63:52 are all 0.
— Bits 51:30 are from the EPT PDPTE.
— Bits 29:0 are from the original guest-physical address.
The format of an EPT PDPTE that maps a 1-GByte page is given in Table 28-2.
• If bit 7 of the EPT PDPTE is 0, a 4-KByte naturally aligned EPT page directory is located at the physical address
specified in bits 51:12 of the EPT PDPTE. The format of an EPT PDPTE that references an EPT page directory is
given in Table 28-3.
1. Not all processors allow bit 7 of an EPT PDPTE to be set to 1. Software should read the VMX capability MSR
IA32_VMX_EPT_VPID_CAP (see Appendix A.10) to determine whether this is allowed.
28-4 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
Table 28-2. Format of an EPT Page-Directory-Pointer-Table Entry (PDPTE) that Maps a 1-GByte Page
Bit Contents
Position(s)
0 Read access; indicates whether reads are allowed from the 1-GByte page referenced by this entry
1 Write access; indicates whether writes are allowed to the 1-GByte page referenced by this entry
2 If the “mode-based execute control for EPT” VM-execution control is 0, execute access; indicates whether
instruction fetches are allowed from the 1-GByte page controlled by this entry
If that control is 1, execute access for supervisor-mode linear addresses; indicates whether instruction fetches are
allowed from supervisor-mode linear addresses in the 1-GByte page controlled by this entry
5:3 EPT memory type for this 1-GByte page (see Section 28.2.7)
6 Ignore PAT memory type for this 1-GByte page (see Section 28.2.7)
8 If bit 6 of EPTP is 1, accessed flag for EPT; indicates whether software has accessed the 1-GByte page referenced
by this entry (see Section 28.2.5). Ignored if bit 6 of EPTP is 0
9 If bit 6 of EPTP is 1, dirty flag for EPT; indicates whether software has written to the 1-GByte page referenced by
this entry (see Section 28.2.5). Ignored if bit 6 of EPTP is 0
10 Execute access for user-mode linear addresses. If the “mode-based execute control for EPT” VM-execution control is
1, indicates whether instruction fetches are allowed from user-mode linear addresses in the 1-GByte page controlled
by this entry. If that control is 0, this bit is ignored.
11 Ignored
59:52 Ignored
60 Supervisor shadow stack. If bit 7 of EPTP is 1, indicates whether supervisor shadow stack accesses are allowed to
guest-physical addresses in the 1-GByte page mapped by this entry (see Section 28.2.3.2).
Ignored if bit 7 of EPTP is 0
62:61 Ignored
63 Suppress #VE. If the “EPT-violation #VE” VM-execution control is 1, EPT violations caused by accesses to this page
are convertible to virtualization exceptions only if this bit is 0 (see Section 25.5.6.1). If “EPT-violation #VE” VM-
execution control is 0, this bit is ignored.
NOTES:
1. N is the physical-address width supported by the logical processor.
Vol. 3C 28-5
VMX SUPPORT FOR ADDRESS TRANSLATION
Table 28-3. Format of an EPT Page-Directory-Pointer-Table Entry (PDPTE) that References an EPT Page Directory
Bit Contents
Position(s)
0 Read access; indicates whether reads are allowed from the 1-GByte region controlled by this entry
1 Write access; indicates whether writes are allowed to the 1-GByte region controlled by this entry
2 If the “mode-based execute control for EPT” VM-execution control is 0, execute access; indicates whether instruction
fetches are allowed from the 1-GByte region controlled by this entry
If that control is 1, execute access for supervisor-mode linear addresses; indicates whether instruction fetches are
allowed from supervisor-mode linear addresses in the 1-GByte region controlled by this entry
8 If bit 6 of EPTP is 1, accessed flag for EPT; indicates whether software has accessed the 1-GByte region controlled
by this entry (see Section 28.2.5). Ignored if bit 6 of EPTP is 0
9 Ignored
10 Execute access for user-mode linear addresses. If the “mode-based execute control for EPT” VM-execution control is
1, indicates whether instruction fetches are allowed from user-mode linear addresses in the 1-GByte region
controlled by this entry. If that control is 0, this bit is ignored.
11 Ignored
(N–1):12 Physical address of 4-KByte aligned EPT page directory referenced by this entry1
63:52 Ignored
NOTES:
1. N is the physical-address width supported by the logical processor.
An EPT page-directory comprises 512 64-bit entries (PDEs). An EPT PDE is selected using the physical address
defined as follows:
— Bits 63:52 are all 0.
— Bits 51:12 are from the EPT PDPTE.
— Bits 11:3 are bits 29:21 of the guest-physical address.
— Bits 2:0 are all 0.
Because an EPT PDE is identified using bits 47:21 of the guest-physical address, it controls access to a 2-MByte
region of the guest-physical-address space. Use of the EPT PDE depends on the value of bit 7 in that entry:
• If bit 7 of the EPT PDE is 1, the EPT PDE maps a 2-MByte page. The final physical address is computed as
follows:
— Bits 63:52 are all 0.
— Bits 51:21 are from the EPT PDE.
— Bits 20:0 are from the original guest-physical address.
The format of an EPT PDE that maps a 2-MByte page is given in Table 28-4.
• If bit 7 of the EPT PDE is 0, a 4-KByte naturally aligned EPT page table is located at the physical address
specified in bits 51:12 of the EPT PDE. The format of an EPT PDE that references an EPT page table is given in
Table 28-5.
An EPT page table comprises 512 64-bit entries (PTEs). An EPT PTE is selected using a physical address defined
as follows:
— Bits 63:52 are all 0.
28-6 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
Table 28-4. Format of an EPT Page-Directory Entry (PDE) that Maps a 2-MByte Page
Bit Contents
Position(s)
0 Read access; indicates whether reads are allowed from the 2-MByte page referenced by this entry
1 Write access; indicates whether writes are allowed to the 2-MByte page referenced by this entry
2 If the “mode-based execute control for EPT” VM-execution control is 0, execute access; indicates whether instruction
fetches are allowed from the 2-MByte page controlled by this entry
If that control is 1, execute access for supervisor-mode linear addresses; indicates whether instruction fetches are
allowed from supervisor-mode linear addresses in the 2-MByte page controlled by this entry
5:3 EPT memory type for this 2-MByte page (see Section 28.2.7)
6 Ignore PAT memory type for this 2-MByte page (see Section 28.2.7)
8 If bit 6 of EPTP is 1, accessed flag for EPT; indicates whether software has accessed the 2-MByte page referenced
by this entry (see Section 28.2.5). Ignored if bit 6 of EPTP is 0
9 If bit 6 of EPTP is 1, dirty flag for EPT; indicates whether software has written to the 2-MByte page referenced by
this entry (see Section 28.2.5). Ignored if bit 6 of EPTP is 0
10 Execute access for user-mode linear addresses. If the “mode-based execute control for EPT” VM-execution control is
1, indicates whether instruction fetches are allowed from user-mode linear addresses in the 2-MByte page controlled
by this entry. If that control is 0, this bit is ignored.
11 Ignored
59:52 Ignored
60 Supervisor shadow stack. If bit 7 of EPTP is 1, indicates whether supervisor shadow stack accesses are allowed to
guest-physical addresses in the 2-MByte page mapped by this entry (see Section 28.2.3.2).
Ignored if bit 7 of EPTP is 0
62:61 Ignored
63 Suppress #VE. If the “EPT-violation #VE” VM-execution control is 1, EPT violations caused by accesses to this page
are convertible to virtualization exceptions only if this bit is 0 (see Section 25.5.6.1). If “EPT-violation #VE” VM-
execution control is 0, this bit is ignored.
NOTES:
1. N is the physical-address width supported by the logical processor.
Vol. 3C 28-7
VMX SUPPORT FOR ADDRESS TRANSLATION
Table 28-5. Format of an EPT Page-Directory Entry (PDE) that References an EPT Page Table
Bit Contents
Position(s)
0 Read access; indicates whether reads are allowed from the 2-MByte region controlled by this entry
1 Write access; indicates whether writes are allowed to the 2-MByte region controlled by this entry
2 If the “mode-based execute control for EPT” VM-execution control is 0, execute access; indicates whether instruction
fetches are allowed from the 2-MByte region controlled by this entry
If that control is 1, execute access for supervisor-mode linear addresses; indicates whether instruction fetches are
allowed from supervisor-mode linear addresses in the 2-MByte region controlled by this entry
8 If bit 6 of EPTP is 1, accessed flag for EPT; indicates whether software has accessed the 2-MByte region controlled
by this entry (see Section 28.2.5). Ignored if bit 6 of EPTP is 0
9 Ignored
10 Execute access for user-mode linear addresses. If the “mode-based execute control for EPT” VM-execution control is
1, indicates whether instruction fetches are allowed from user-mode linear addresses in the 2-MByte region
controlled by this entry. If that control is 0, this bit is ignored.
11 Ignored
(N–1):12 Physical address of 4-KByte aligned EPT page table referenced by this entry1
63:52 Ignored
NOTES:
1. N is the physical-address width supported by the logical processor.
NOTE
If the “mode-based execute control for EPT” VM-execution control is 1, an EPT paging-structure
entry is present if any of bits 2:0 or bit 10 is 1. If bits 2:0 are all 0 but bit 10 is 1, the entry is used
normally to reference another EPT paging-structure entry or to produce a physical address.
The discussion above describes how the EPT paging structures reference each other and how the logical processor
traverses those structures when translating a guest-physical address. It does not cover all details of the translation
process. Additional details are provided as follows:
• Situations in which the translation process may lead to VM exits (sometimes before the process completes) are
described in Section 28.2.3.
• Interactions between the EPT translation mechanism and memory typing are described in Section 28.2.7.
Figure 28-1 gives a summary of the formats of the EPTP and the EPT paging-structure entries. For the EPT paging
structure entries, it identifies separately the format of entries that map pages, those that reference other EPT
28-8 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
paging structures, and those that do neither because they are not present; bits 2:0 and bit 7 are highlighted
because they determine how a paging-structure entry is used. (Figure 28-1 does not comprehend the fact that, if
the “mode-based execute control for EPT” VM-execution control is 1, an entry is present if any of bits 2:0 or bit 10
is 1.)
1. If the logical processor is using PAE paging—because CR0.PG = CR4.PAE = 1 and IA32_EFER.LMA = 0—the MOV to CR3 instruction
loads the PDPTEs from memory using the guest-physical address being loaded into CR3. In this case, therefore, the MOV to CR3
instruction may cause an EPT misconfiguration, an EPT violation, or a page-modification log-full event.
2. If the “mode-based execute control for EPT” VM-execution control is 1, setting bit 2 indicates that instruction fetches are allowed
from supervisor-mode linear addresses.
3. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
Vol. 3C 28-9
VMX SUPPORT FOR ADDRESS TRANSLATION
Table 28-6. Format of an EPT Page-Table Entry that Maps a 4-KByte Page
Bit Contents
Position(s)
0 Read access; indicates whether reads are allowed from the 4-KByte page referenced by this entry
1 Write access; indicates whether writes are allowed to the 4-KByte page referenced by this entry
2 If the “mode-based execute control for EPT” VM-execution control is 0, execute access; indicates whether
instruction fetches are allowed from the 4-KByte page controlled by this entry
If that control is 1, execute access for supervisor-mode linear addresses; indicates whether instruction fetches are
allowed from supervisor-mode linear addresses in the 4-KByte page controlled by this entry
5:3 EPT memory type for this 4-KByte page (see Section 28.2.7)
6 Ignore PAT memory type for this 4-KByte page (see Section 28.2.7)
7 Ignored
8 If bit 6 of EPTP is 1, accessed flag for EPT; indicates whether software has accessed the 4-KByte page referenced
by this entry (see Section 28.2.5). Ignored if bit 6 of EPTP is 0
9 If bit 6 of EPTP is 1, dirty flag for EPT; indicates whether software has written to the 4-KByte page referenced by
this entry (see Section 28.2.5). Ignored if bit 6 of EPTP is 0
10 Execute access for user-mode linear addresses. If the “mode-based execute control for EPT” VM-execution control is
1, indicates whether instruction fetches are allowed from user-mode linear addresses in the 4-KByte page controlled
by this entry. If that control is 0, this bit is ignored.
11 Ignored
59:52 Ignored
60 Supervisor shadow stack. If bit 7 of EPTP is 1, indicates whether supervisor shadow stack accesses are allowed to
guest-physical addresses in the 4-KByte page mapped by this entry (see Section 28.2.3.2).
Ignored if bit 7 of EPTP is 0
61 Sub-page write permissions. If the “sub-page write permissions for EPT” VM-execution control is 1, writes to
individual 128-byte regions of the 4-KByte page referenced by this entry may be allowed even if the page would
normally not be writable (see Section 28.2.4). If “sub-page write permissions for EPT” VM-execution control is 0, this
bit is ignored.
62 Ignored
63 Suppress #VE. If the “EPT-violation #VE” VM-execution control is 1, EPT violations caused by accesses to this page
are convertible to virtualization exceptions only if this bit is 0 (see Section 25.5.6.1). If “EPT-violation #VE” VM-
execution control is 0, this bit is ignored.
NOTES:
1. N is the physical-address width supported by the logical processor.
28-10 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
• The access is a data read and, for any byte to be read, bit 0 (read access) was clear in any of the EPT paging-
structure entries used to translate the guest-physical address of the byte. Reads by the logical processor of
guest paging structures to translate a linear address are considered to be data reads.
Vol. 3C 28-11
VMX SUPPORT FOR ADDRESS TRANSLATION
6666555555555 33322222222221111111111
M1 M-1
3210987654321 210987654321098765432109876543210
S A EPT EPT
S
Reserved Address of EPT PML4 table Rsvd. S / PWL– PS EPTP3
2 D 1 MT
X Ig
Ig U X PML4E:
Ignored Rsvd. Address of EPT page-directory-pointer table n. 4 n. A Reserved 5 W R present6
S PML4E:
V Ignored 000 not
E7 present
S S I
Physical PDPTE:
Ig X D A 1 P EPT X W R
V Ign. SS Ignored Rsvd. address of Reserved 1GB
n. U A MT
E 8 1GB page page
T
Ig X Ig A 0 Rsvd. X W R PDPTE:
Ignored Rsvd. Address of EPT page directory page
n. U n. directory
S PDTPE:
V Ignored 000 not
E present
S S I PDE:
Physical address Ig X D A 1 P EPT X W R
V Ign. S Ignored Rsvd. Reserved 2MB
of 2MB page n. U A MT
E S page
T
PDE:
Ig X Ig A 0 Rsvd. X W R
Ignored Rsvd. Address of EPT page table page
n. U n. table
S PDE:
V Ignored 000 not
E present
S S S I
Ig X D A gI P EPT X W R
Ig P PTE:
V S Ignored Rsvd. Physical address of 4KB page n. U 4KB
n A
n. P MT
E 9 S page
T
S PTE:
V Ignored 000 not
E present
• The access is a data write and, for any byte to be written, bit 1 (write access) was clear in any of the EPT
paging-structure entries used to translate the guest-physical address of the byte. Writes by the logical
processor to guest paging structures to update accessed and dirty flags are considered to be data writes.
28-12 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
If bit 6 of the EPT pointer (EPTP) is 1 (enabling accessed and dirty flags for EPT), processor accesses to guest
paging-structure entries are treated as writes with regard to EPT violations. Thus, if bit 1 is clear in any of the
EPT paging-structure entries used to translate the guest-physical address of a guest paging-structure entry, an
attempt to use that entry to translate a linear address causes an EPT violation.
(This does not apply to loads of the PDPTE registers by the MOV to CR instruction for PAE paging; see Section
4.4.1. Those loads of guest PDPTEs are treated as reads and do not cause EPT violations due to a guest-
physical address not being writable.)
If the “sub-page write permissions for EPT” VM-execution control is 1, data writes to a guest-physical address
that would cause an EPT violation (as indicated above) do not do so in certain situations. If the guest-physical
address is mapped using a 4-KByte page and bit 61 (sub-page write permissions) of the EPT PTE used to map
the page is 1, writes to certain 128-byte sub-pages may be allowed. See Section 28.2.4 for details.
• The access is an instruction fetch and the EPT paging structures prevent execute access to any of the bytes
being fetched. Whether this occurs depends upon the setting of the “mode-based execute control for EPT” VM-
execution control:
— If the control is 0, an instruction fetch from a byte is prevented if bit 2 (execute access) was clear in any of
the EPT paging-structure entries used to translate the guest-physical address of the byte.
— If the control is 1, an instruction fetch from a byte is prevented in either of the following cases:
• Paging maps the linear address of the byte as a supervisor-mode address and bit 2 (execute access for
supervisor-mode linear addresses) was clear in any of the EPT paging-structure entries used to
translate the guest-physical address of the byte.
Paging maps a linear address as a supervisor-mode address if the U/S flag (bit 2) is 0 in at least one of
the paging-structure entries controlling the translation of the linear address.
• Paging maps the linear address of the byte as a user-mode address and bit 10 (execute access for user-
mode linear addresses) was clear in any of the EPT paging-structure entries used to translate the guest-
physical address of the byte.
Paging maps a linear address as a user-mode address if the U/S flag is 1 in all of the paging-structure
entries controlling the translation of the linear address. If paging is disabled (CR0.PG = 0), every linear
address is a user-mode address.
• If supervisor shadow-stack control is enabled (by setting bit 7 of EPTP), the access is a supervisor shadow-
stack access, and the EPT paging-structure entries used to translate the guest-physical address of the access
disallow supervisor shadow-stack accesses. Such an access is disallowed if any of the following hold:
— Bit 0 (read access) is clear in any EPT paging-structure entry used to translate the guest-physical address
of the access.
— Bit 1 (write access) is clear in any EPT paging-structure entry that references an EPT paging structure in the
translation of the guest-physical address. (Clearing bit 1 in the EPT paging-structure entry that maps the
page of the guest-physical address does not disallow shadow-stack reads and writes.)
— Bit 60 (supervisor-shadow stack access) is clear in the EPT paging-structure entry that maps the page of
the guest-physical address.
Supervisor shadow-stack control and the supervisor-shadow stack access bits in EPT paging-structure entries
do not affect other accesses (including user shadow-stack accesses).
Vol. 3C 28-13
VMX SUPPORT FOR ADDRESS TRANSLATION
a. If the entry is not present (see Section 28.2.2), an EPT violation occurs.
b. If the entry is present but its contents are not configured properly (see Section 28.2.3.1), an EPT miscon-
figuration occurs.
c. If the entry is present and its contents are configured properly, operation depends on whether the entry
references another EPT paging structure (whether it is an EPT PDE with bit 7 set to 1 or an EPT PTE):
i) If the entry does reference another EPT paging structure, an entry from that structure is accessed;
step 1 is executed for that other entry.
ii) Otherwise, the entry is used to produce the ultimate physical address (the translation of the original
guest-physical address); step 2 is executed.
2. Once the ultimate physical address is determined, the privileges determined by the EPT paging-structure
entries are evaluated:
a. If the access to the guest-physical address is not allowed by these privileges (see Section 28.2.3.2), an EPT
violation occurs.
b. If the access to the guest-physical address is allowed by these privileges, memory is accessed using the
ultimate physical address.
If CR0.PG = 1, the translation of a linear address is also an iterative process, with the processor first accessing an
entry in the guest paging structure referenced by the guest-physical address in CR3 (or, if PAE paging is in use, the
guest-physical address in the appropriate PDPTE register), then accessing an entry in another guest paging struc-
ture referenced by the guest-physical address in the first guest paging-structure entry, etc. Each guest-physical
address is itself translated using EPT and may cause an EPT-induced VM exit. The following items detail how page
faults and EPT-induced VM exits are recognized during this iterative process:
1. An attempt is made to access a guest paging-structure entry with a guest-physical address (initially, the
address in CR3 or PDPTE register).
a. If the access fails because of an EPT misconfiguration or an EPT violation (see above), an EPT-induced
VM exit occurs.
b. If the access does not cause an EPT-induced VM exit, bit 0 (the present flag) of the entry is consulted:
i) If the present flag is 0 or any reserved bit is set, a page fault occurs.
ii) If the present flag is 1, no reserved bit is set, operation depends on whether the entry references
another guest paging structure (whether it is a guest PDE with PS = 1 or a guest PTE):
• If the entry does reference another guest paging structure, an entry from that structure is
accessed; step 1 is executed for that other entry.
• Otherwise, the entry is used to produce the ultimate guest-physical address (the translation of the
original linear address); step 2 is executed.
2. Once the ultimate guest-physical address is determined, the privileges determined by the guest paging-
structure entries are evaluated:
a. If the access to the linear address is not allowed by these privileges (e.g., it was a write to a read-only
page), a page fault occurs.
b. If the access to the linear address is allowed by these privileges, an attempt is made to access memory at
the ultimate guest-physical address:
i) If the access fails because of an EPT misconfiguration or an EPT violation (see above), an EPT-induced
VM exit occurs.
ii) If the access does not cause an EPT-induced VM exit, memory is accessed using the ultimate physical
address (the translation, using EPT, of the ultimate guest-physical address).
If CR0.PG = 0, a linear address is treated as a guest-physical address and is translated using EPT (see above). This
process, if it completes without an EPT violation or EPT misconfiguration, produces a physical address and deter-
mines the privileges allowed by the EPT paging-structure entries. If these privileges do not allow the access to the
physical address (see Section 28.2.3.2), an EPT violation occurs. Otherwise, memory is accessed using the physical
address.
28-14 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
28.2.4.1 Write Accesses That Are Eligible for Sub-Page Write Permissions
A guest-physical address is eligible for sub-page write permissions if writes to it would be disallowed following
Section 28.2.3.2: bit 1 (write access) is clear in any of the EPT paging-structure entries used to translate the guest-
physical address. Guest-physical addresses to which writes would be disallowed for other reasons (e.g., the trans-
lation encounters an EPT paging-structure entry that is not present) are not eligible for sub-page write permis-
sions.
In addition, a guest-physical address is eligible for sub-page write permissions only if it is mapped using a 4-KByte
page and bit 61 (sub-page write permissions) of the EPT PTE used to map the page is 1. (Guest-physical addresses
mapped with larger pages are not eligible for sub-page write permissions.)
For some memory accesses, the processor ignores bit 61 in an EPT PTE used to map a 4-KByte page and does not
apply sub-page write permissions to the access. (In such a case, the access causes an EPT violation when indicated
by the conditions given in Section 28.2.3.2.) Sub-page write permissions never apply to the following accesses:
• A write access performed within a transactional region.
• A write access by an enclave to an address within the enclave's ELRANGE. (Sub-page write permissions may
apply to write accesses by an enclaves to addresses outside its ELRANGE.)
• A write access to the enclave page cache (EPC) by an Intel SGX instruction.
• A write access to a guest paging structure to update an accessed or dirty flag.
• Processor accesses to guest paging-structure entries when accessed and dirty flags for EPT are enabled (such
accesses are treated as writes with regard to EPT violations.
There are additional accesses to which sub-page write permissions might not be applied (behavior is model-
specific). The following items enumerate examples:
• A write access that crosses two 4-KByte pages. In this case, sub-page permissions may be applied to neither
or to only one of the pages. (There is no write to either page unless the write is allowed to both pages.)
• A write access by an instruction that performs multiple write accesses (sub-page write permissions are
intended principally for basic instructions such as AND, MOV, OR, TEST, XCHG, and XOR).
If a guest-physical address is eligible for sub-page write permissions, the processor determines whether to allow
write to the address using the process described in Section 28.2.4.2.
If a guest-physical address is eligible for sub-page write permissions and that address translates to an address on
the APIC-access page (see Section 29.4), the processor may treat a write access to the address as if the “virtualize
APIC accesses” VM-execution control were 0. For that reason, it is recommended that software not configure any
guest-physical address that translates to an address on the APIC-access page to be eligible for sub-page write
permissions.
Vol. 3C 28-15
VMX SUPPORT FOR ADDRESS TRANSLATION
For each guest-physical address eligible for sub-page write permissions, there is a 64-bit sub-page permission
vector (SPP vector). All addresses on a 4-KByte page use the same SPP vector. If an address’s sub-page number
(bits 11:7 of the address) is S, writes to address are allowed if and only if bit 2S of the sub-page permission is set
to 1. (The bits at odd positions in a SPP vector are not used and must be zero.)
Each page’s SPP vector is located in memory. For a write to a guest-physical address eligible for sub-page write
permissions, the processor uses the following process to locate the address’s SPP vector:
1. The SPPTP (sub-page-permission-table pointer) VM-execution control field contains the physical address of the
4-KByte root SPP table (SSPL4 table). Bits 47:39 of the guest-physical address identify a 64-bit entry in that
table, called an SPPL4E.
2. A 4-KByte SPPL3 table is located at the physical address in the selected SPPL4E. Bits 38:30 of the guest-
physical address identify a 64-bit entry in that table, called an SPPL3E.
3. A 4-KByte SPPL2 table is located at the physical address in the selected SPPL3E. Bits 29:21 of the guest-
physical address identify a 64-bit entry in that table, called an SPPL2E.
4. A 4-KByte SPP-vector table (SSPL1 table) is located at the physical address in the selected SPPL2E. Bits 20:12
of the guest-physical address identify the 64-bit SPP vector for the address. As noted earlier, bit 2S of the sub-
page permission vector determines whether the address may be written, where S is the value of address
bits 11:7.
(The memory type used to access these tables is reported in bits 53:50 of the IA32_VMX_BASIC MSR. See
Appendix A.1.)
A write access to multiple 128-byte sub-pages on a single 4-KByte page is allowed only if the indicated bit in the
page’s SPP vector for each of those sub-pages. The following items apply to cases in which an access writes to two
4-KByte pages:
• If a write to either page would be disallowed according to Section 28.2.3.2, the access might be disallowed
even if the guest-physical address of that page is eligible for sub-page write permissions. (This behavior is
model-specific.)
• The access is allowed only if, for each page, either (1) a write to the page would be allowed following Section
28.2.3.2; or (2) both (a) the guest-physical address of that page is eligible for sub-page write permissions; and
(b) the page’s sub-page vector allows the write (as described above).
Bit 0 of each entry (SPPL4E, SPPL3E, or SPPL2E) is the entry’s valid bit. If the process above accesses an entry in
which this bit is 0, the process stops and the logical processor incurs an SPP miss.
In each entry (SPPL4E, SPPL4E, or SPPL2E), bits 11:1 are reserved, as are bits 63:N, where N is the processor’s
physical-address width. If the process above accesses an entry in which the valid bit is 1 and in which some
reserved bit is set, the process stops and the logical processor incurs an SPP misconfiguration. Bits in an SPP
vector in odd positions are also reserved; an SPP misconfiguration occurs also any of those bits are set in the final
SPP vector.
SPP misses and SPP misconfigurations are called SPP-related events and cause VM exits.
28-16 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
Whenever the processor uses an EPT paging-structure entry as part of guest-physical-address translation, it sets
the accessed flag in that entry (if it is not already set).
Whenever there is a write to a guest-physical address, the processor sets the dirty flag (if it is not already set) in
the EPT paging-structure entry that identifies the final physical address for the guest-physical address (either an
EPT PTE or an EPT paging-structure entry in which bit 7 is 1).
When accessed and dirty flags for EPT are enabled, processor accesses to guest paging-structure entries are
treated as writes (see Section 28.2.3.2). Thus, such an access will cause the processor to set the dirty flag in the
EPT paging-structure entry that identifies the final physical address of the guest paging-structure entry.
(This does not apply to loads of the PDPTE registers for PAE paging by the MOV to CR instruction; see Section 4.4.1.
Those loads of guest PDPTEs are treated as reads and do not cause the processor to set the dirty flag in any EPT
paging-structure entry.)
These flags are “sticky,” meaning that, once set, the processor does not clear them; only software can clear them.
A processor may cache information from the EPT paging-structure entries in TLBs and paging-structure caches
(see Section 28.3). This fact implies that, if software changes an accessed flag or a dirty flag from 1 to 0, the
processor might not set the corresponding bit in memory on a subsequent access using an affected guest-physical
address.
Vol. 3C 28-17
VMX SUPPORT FOR ADDRESS TRANSLATION
1. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PG must be 1 in VMX operation, CR0.PG can be 0 in VMX non-root
operation only if the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution controls are
both 1.
2. Table 11-11 in Section 11.12.3, “Selecting a Memory Type from the PAT” illustrates how the PAT memory type is selected based on
the values of the PAT, PCD, and PWT bits in a page-table entry (or page-directory entry with PS = 1). For accesses to a guest paging-
structure entry X, the PAT memory type is selected from the table by using a value of 0 for the PAT bit with the values of PCD and
PWT from the paging-structure entry Y that references X (or from CR3 if X is in the root paging structure). With PAE paging, the PAT
memory type for accesses to the PDPTEs is WB.
28-18 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
and from that address space (to the physical-address space). Both features control the ways in which a logical
processor may create and use information cached from the paging structures.
Section 28.3.1 describes the different kinds of information that may be cached. Section 28.3.2 specifies when such
information may be cached and how it may be used. Section 28.3.3 details how software can invalidate cached
information.
1. Earlier versions of this manual used the term “VPID-tagged” to identify linear mappings.
2. Earlier versions of this manual used the term “EPTP-tagged” to identify guest-physical mappings.
3. Earlier versions of this manual used the term “dual-tagged” to identify combined mappings.
Vol. 3C 28-19
VMX SUPPORT FOR ADDRESS TRANSLATION
• The following items describe the creation of mappings while EPT is not in use (including execution outside VMX
non-root operation):
— Linear mappings may be created. They are derived from the paging structures referenced (directly or
indirectly) by the current value of CR3 and are associated with the current VPID and the current PCID.
— No linear mappings are created with information derived from paging-structure entries that are not present
(bit 0 is 0) or that set reserved bits. For example, if a PTE is not present, no linear mapping are created for
any linear page number whose translation would use that PTE.
— No guest-physical or combined mappings are created while EPT is not in use.
• The following items describe the creation of mappings while EPT is in use:
— Guest-physical mappings may be created. They are derived from the EPT paging structures referenced
(directly or indirectly) by bits 51:12 of the current EPTP. These 40 bits contain the address of the EPT-PML4-
table. (the notation EP4TA refers to those 40 bits). Newly created guest-physical mappings are associated
with the current EP4TA.
— Combined mappings may be created. They are derived from the EPT paging structures referenced (directly
or indirectly) by the current EP4TA. If CR0.PG = 1, they are also derived from the paging structures
referenced (directly or indirectly) by the current value of CR3. They are associated with the current VPID,
the current PCID, and the current EP4TA.1 No combined paging-structure-cache entries are created if
CR0.PG = 0.2
— No guest-physical mappings or combined mappings are created with information derived from EPT paging-
structure entries that are not present (see Section 28.2.2) or that are misconfigured (see Section
28.2.3.1).
— No combined mappings are created with information derived from guest paging-structure entries that are
not present or that set reserved bits.
— No linear mappings are created while EPT is in use.
The following items detail the use of the various mappings:
• If EPT is not in use (e.g., when outside VMX non-root operation), a logical processor may use cached mappings
as follows:
— For accesses using linear addresses, it may use linear mappings associated with the current VPID and the
current PCID. It may also use global TLB entries (linear mappings) associated with the current VPID and
any PCID.
— No guest-physical or combined mappings are used while EPT is not in use.
• If EPT is in use, a logical processor may use cached mappings as follows:
— For accesses using linear addresses, it may use combined mappings associated with the current VPID, the
current PCID, and the current EP4TA. It may also use global TLB entries (combined mappings) associated
with the current VPID, the current EP4TA, and any PCID.
— For accesses using guest-physical addresses, it may use guest-physical mappings associated with the
current EP4TA.
— No linear mappings are used while EPT is in use.
4. This section associated cached information with the current VPID and PCID. If PCIDs are not supported or are not being used (e.g.,
because CR4.PCIDE = 0), all the information is implicitly associated with PCID 000H; see Section 4.10.1, “Process-Context Identifiers
(PCIDs),” in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
1. At any given time, a logical processor may be caching combined mappings for a VPID and a PCID that are associated with different
EP4TAs. Similarly, it may be caching combined mappings for an EP4TA that are associated with different VPIDs and PCIDs.
2. If the capability MSR IA32_VMX_CR0_FIXED0 reports that CR0.PG must be 1 in VMX operation, CR0.PG can be 0 in VMX non-root
operation only if the “unrestricted guest” VM-execution control and bit 31 of the primary processor-based VM-execution controls are
both 1.
28-20 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
1. See Section 4.10.4, “Invalidation of TLBs and Paging-Structure Caches,” in the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 3A for an enumeration of operations that architecturally invalidate entries in the TLBs and paging-structure
caches independent of VMX operation.
2. While no linear mappings are created while EPT is in use, a logical processor may retain, while EPT is in use, linear mappings (for the
same VPID as the current one) there were created earlier, when EPT was not in use.
3. While no combined mappings are created while EPT is not in use, a logical processor may retain, while EPT is in not use, combined
mappings (for the same VPID as the current one) there were created earlier, when EPT was in use.
Vol. 3C 28-21
VMX SUPPORT FOR ADDRESS TRANSLATION
• Execution of the INVEPT instruction invalidates guest-physical mappings and combined mappings. Invalidation
is based on instruction operands, called the INVEPT type and the INVEPT descriptor. Two INVEPT types are
currently defined:
— Single-context. If the INVEPT type is 1, the logical processor invalidates all guest-physical mappings and
combined mappings associated with the EP4TA specified in the INVEPT descriptor. Combined mappings for
that EP4TA are invalidated for all VPIDs and all PCIDs. (The instruction may invalidate mappings associated
with other EP4TAs.)
— All-context. If the INVEPT type is 2, the logical processor invalidates guest-physical mappings and
combined mappings associated with all EP4TAs (and, for combined mappings, for all VPIDs and PCIDs).
See Chapter 30 for details of the INVEPT instruction. See Section 28.3.3.4 for guidelines regarding use of this
instruction.
• A power-up or a reset invalidates all linear mappings, guest-physical mappings, and combined mappings.
28-22 Vol. 3C
VMX SUPPORT FOR ADDRESS TRANSLATION
— The VPID in the INVVPID descriptor is the one assigned to the virtual processor whose execution is being
emulated.
• Some instructions invalidate all entries in the TLBs and paging-structure caches—including for global transla-
tions. An example is the MOV to CR4 instruction if the value of value of bit 4 (page global enable—PGE) is
changing. Emulation of such an instruction may require execution of the INVVPID instruction as follows:
— The INVVPID type is single-context (1).
— The VPID in the INVVPID descriptor is the one assigned to the virtual processor whose execution is being
emulated.
If EPT is not in use, the logical processor associates all mappings it creates with the current VPID, and it will use
such mappings to translate linear addresses. For that reason, a VMM should not use the same VPID for different
non-EPT guests that use different page tables. Doing so may result in one guest using translations that pertain to
the other.
If EPT is in use, the instructions enumerated above might not be configured to cause VM exits and the VMM might
not be emulating them. In that case, executions of the instructions by guest software properly invalidate the
required entries in the TLBs and paging-structure caches (see Section 28.3.3.1); execution of the INVVPID instruc-
tion is not required.
If EPT is in use, the logical processor associates all mappings it creates with the value of bits 51:12 of current EPTP.
If a VMM uses different EPTP values for different guests, it may use the same VPID for those guests. Doing so
cannot result in one guest using translations that pertain to the other.
The following guidelines apply more generally and are appropriate even if EPT is in use:
• As detailed in Section 29.4.5, an access to the APIC-access page might not cause an APIC-access VM exit if
software does not properly invalidate information that may be cached from the paging structures. If, at one
time, the current VPID on a logical processor was a non-zero value X, it is recommended that software use the
INVVPID instruction with the “single-context” INVVPID type and with VPID X in the INVVPID descriptor before
a VM entry on the same logical processor that establishes VPID X and either (a) the “virtualize APIC accesses”
VM-execution control was changed from 0 to 1; or (b) the value of the APIC-access address was changed.
• Software can use the INVVPID instruction with the “all-context” INVVPID type immediately after execution of
the VMXON instruction or immediately prior to execution of the VMXOFF instruction. Either prevents potentially
undesired retention of information cached from paging structures between separate uses of VMX operation.
1. If the “mode-based execute control for EPT” VM-execution control is 1, software should use the INVEPT instruction after changing
privilege bit 10 from 1 to 0.
Vol. 3C 28-23
VMX SUPPORT FOR ADDRESS TRANSLATION
• Software should use the INVEPT instruction with the “single-context” INVEPT type before a VM entry with an
EPTP value X such that X[6] = 1 (accessed and dirty flags for EPT are enabled) if the logical processor had
earlier been in VMX non-root operation with an EPTP value Y such that Y[6] = 0 (accessed and dirty flags for
EPT are not enabled) and Y[51:12] = X[51:12].
• Software may use the INVEPT instruction after modifying a present EPT paging-structure entry (see Section
28.2.2) to change any of the privilege bits 2:0 from 0 to 1.1 Failure to do so may cause an EPT violation that
would not otherwise occur. Because an EPT violation invalidates any mappings that would be used by the access
that caused the EPT violation (see Section 28.3.3.1), an EPT violation will not recur if the original access is
performed again, even if the INVEPT instruction is not executed.
• Because a logical processor does not cache any information derived from EPT paging-structure entries that are
not present (see Section 28.2.2) or misconfigured (see Section 28.2.3.1), it is not necessary to execute INVEPT
following modification of an EPT paging-structure entry that had been not present or misconfigured.
• As detailed in Section 29.4.5, an access to the APIC-access page might not cause an APIC-access VM exit if
software does not properly invalidate information that may be cached from the EPT paging structures. If EPT
was in use on a logical processor at one time with EPTP X, it is recommended that software use the INVEPT
instruction with the “single-context” INVEPT type and with EPTP X in the INVEPT descriptor before a VM entry
on the same logical processor that enables EPT with EPTP X and either (a) the “virtualize APIC accesses” VM-
execution control was changed from 0 to 1; or (b) the value of the APIC-access address was changed.
• Software can use the INVEPT instruction with the “all-context” INVEPT type immediately after execution of the
VMXON instruction or immediately prior to execution of the VMXOFF instruction. Either prevents potentially
undesired retention of information cached from EPT paging structures between separate uses of VMX
operation.
In a system containing more than one logical processor, software must account for the fact that information from
an EPT paging-structure entry may be cached on logical processors other than the one that modifies that entry. The
process of propagating the changes to a paging-structure entry is commonly referred to as “TLB shootdown.” A
discussion of TLB shootdown appears in Section 4.10.5, “Propagation of Paging-Structure Changes to Multiple
Processors,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
1. If the “mode-based execute control for EPT” VM-execution control is 1, software may use the INVEPT instruction after modifying a
present EPT paging-structure entry to change privilege bit 10 from 0 to 1.
28-24 Vol. 3C
24.Updates to Chapter 34, Volume 3C
Change bars and green text show changes to Chapter 34 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3C: System Programming Guide, Part 3.
------------------------------------------------------------------------------------------
Changes to chapter: Added Control-flow Enforcement Technology System Management Mode details.
This chapter describes aspects of IA-64 and IA-32 architecture used in system management mode (SMM).
SMM provides an alternate operating environment that can be used to monitor and manage various system
resources for more efficient energy usage, to control system hardware, and/or to run proprietary code. It was
introduced into the IA-32 architecture in the Intel386 SL processor (a mobile specialized version of the Intel386
processor). It is also available in the Pentium M, Pentium 4, Intel Xeon, P6 family, and Pentium and Intel486
processors (beginning with the enhanced versions of the Intel486 SL and Intel486 processors).
NOTES
Software developers should be aware that, even if a logical processor was using the physical-
address extension (PAE) mechanism (introduced in the P6 family processors) or was in IA-32e
mode before an SMI, this will not be the case after the SMI is delivered. This is because delivery of
an SMI disables paging (see Table 34-4). (This does not apply if the dual-monitor treatment of SMIs
and SMM is active; see Section 34.15.)
Vol. 3C 34-1
SYSTEM MANAGEMENT MODE
NOTES
In the Pentium 4, Intel Xeon, and P6 family processors, when a processor that is designated as an
application processor during an MP initialization sequence is waiting for a startup IPI (SIPI), it is in
a mode where SMIs are masked. However if a SMI is received while an application processor is in
the wait for SIPI mode, the SMI will be pended. The processor then responds on receipt of a SIPI by
immediately servicing the pended SMI and going into SMM before handling the SIPI.
An SMI may be blocked for one instruction following execution of STI, MOV to SS, or POP into SS.
1. The dual-monitor treatment may not be supported by all processors. Software should consult the VMX capability MSR
IA32_VMX_BASIC (see Appendix A.1) to determine whether it is supported.
34-2 Vol. 3C
SYSTEM MANAGEMENT MODE
Upon entering SMM, the processor signals external hardware that SMI handling has begun. The signaling mecha-
nism used is implementation dependent. For the P6 family processors, an SMI acknowledge transaction is gener-
ated on the system bus and the multiplexed status signal EXF4 is asserted each time a bus transaction is generated
while the processor is in SMM. For the Pentium and Intel486 processors, the SMIACT# pin is asserted.
An SMI has a greater priority than debug exceptions and external interrupts. Thus, if an NMI, maskable hardware
interrupt, or a debug exception occurs at an instruction boundary along with an SMI, only the SMI is handled.
Subsequent SMI requests are not acknowledged while the processor is in SMM. The first SMI interrupt request that
occurs while the processor is in SMM (that is, after SMM has been acknowledged to external hardware) is latched
and serviced when the processor exits SMM with the RSM instruction. The processor will latch only one SMI while
in SMM.
See Section 34.5 for a detailed description of the execution environment when in SMM.
NOTE
On processors that support the shadow-stack feature, RSM loads the SSP register from the state
save image in SMRAM (see Table 34-3). The value is made canonical by sign-extension before
loading it into SSP.
Upon successful completion of the RSM instruction, the processor signals external hardware that SMM has been
exited. For the P6 family processors, an SMI acknowledge transaction is generated on the system bus and the
multiplexed status signal EXF4 is no longer generated on bus cycles. For the Pentium and Intel486 processors, the
SMIACT# pin is deserted.
If the processor detects invalid state information saved in the SMRAM, it enters the shutdown state and generates
a special bus cycle to indicate it has entered shutdown state. Shutdown happens only in the following situations:
• A reserved bit in control register CR4 is set to 1 on a write to CR4. This error should not happen unless SMI
handler code modifies reserved areas of the SMRAM saved state map (see Section 34.4.1). CR4 is saved in the
state map in a reserved location and cannot be read or modified in its saved state.
• An illegal combination of bits is written to control register CR0, in particular PG set to 1 and PE set to 0, or NW
set to 1 and CD set to 0.
• CR4.PCIDE would be set to 1 and IA32_EFER.LMA to 0.
• (For the Pentium and Intel486 processors only.) If the address stored in the SMBASE register when an RSM
instruction is executed is not aligned on a 32-KByte boundary. This restriction does not apply to the P6 family
processors.
• CR4.CET would be set to 1 and CR0.WP to 0.
In the shutdown state, Intel processors stop executing instructions until a RESET#, INIT# or NMI# is asserted.
While Pentium family processors recognize the SMI# signal in shutdown state, P6 family and Intel486 processors
do not. Intel does not support using SMI# to recover from shutdown states for any processor family; the response
of processors in this circumstance is not well defined. On Pentium 4 and later processors, shutdown will inhibit INTR
and A20M but will not change any of the other inhibits. On these processors, NMIs will be inhibited if no action is
taken in the SMI handler to uninhibit them (see Section 34.8).
If the processor is in the HALT state when the SMI is received, the processor handles the return from SMM slightly
differently (see Section 34.10). Also, the SMBASE address can be changed on a return from SMM (see Section
34.11).
Vol. 3C 34-3
SYSTEM MANAGEMENT MODE
34.4 SMRAM
Upon entering SMM, the processor switches to a new address space. Because paging is disabled upon entering
SMM, this initial address space maps all memory accesses to the low 4 GBytes of the processor's physical address
space. The SMI handler's critical code and data reside in a memory region referred to as system-management RAM
(SMRAM). The processor uses a pre-defined region within SMRAM to save the processor's pre-SMI context. SMRAM
can also be used to store system management information (such as the system configuration and specific informa-
tion about powered-down devices) and OEM-specific information.
The default SMRAM size is 64 KBytes beginning at a base physical address in physical memory called the SMBASE
(see Figure 34-1). The SMBASE default value following a hardware reset is 30000H. The processor looks for the
first instruction of the SMI handler at the address [SMBASE + 8000H]. It stores the processor’s state in the area
from [SMBASE + FE00H] to [SMBASE + FFFFH]. See Section 34.4.1 for a description of the mapping of the state
save area.
The system logic is minimally required to decode the physical address range for the SMRAM from [SMBASE +
8000H] to [SMBASE + FFFFH]. A larger area can be decoded if needed. The size of this SMRAM can be between 32
KBytes and 4 GBytes.
The location of the SMRAM can be changed by changing the SMBASE value (see Section 34.11). It should be noted
that all processors in a multiple-processor system are initialized with the same SMBASE value (30000H). Initializa-
tion software must sequentially place each processor in SMM and change its SMBASE so that it does not overlap
those of other processors.
The actual physical location of the SMRAM can be in system memory or in a separate RAM memory. The processor
generates an SMI acknowledge transaction (P6 family processors) or asserts the SMIACT# pin (Pentium and
Intel486 processors) when the processor receives an SMI (see Section 34.3.1).
System logic can use the SMI acknowledge transaction or the assertion of the SMIACT# pin to decode accesses to
the SMRAM and redirect them (if desired) to specific SMRAM memory. If a separate RAM memory is used for
SMRAM, system logic should provide a programmable method of mapping the SMRAM into system memory space
when the processor is not in SMM. This mechanism will enable start-up procedures to initialize the SMRAM space
(that is, load the SMI handler) before executing the SMI handler during SMM.
SMRAM
SMBASE + FFFFH
Start of State Save Area
SMBASE
34-4 Vol. 3C
SYSTEM MANAGEMENT MODE
The following registers are saved (but not readable) and restored upon exiting SMM:
• Control register CR4. (This register is cleared to all 0s when entering SMM).
• The hidden segment descriptor information stored in segment registers CS, DS, ES, FS, GS, and SS.
If an SMI request is issued for the purpose of powering down the processor, the values of all reserved locations in
the SMM state save must be saved to nonvolatile memory.
The following state is not automatically saved and restored following an SMI and the RSM instruction, respectively:
Vol. 3C 34-5
SYSTEM MANAGEMENT MODE
NOTES
A small subset of the MSRs (such as, the time-stamp counter and performance-monitoring
counters) are not arbitrarily writable and therefore cannot be saved and restored. SMM-based
power-down and restoration should only be performed with operating systems that do not use or
rely on the values of these registers.
Operating system developers should be aware of this fact and insure that their operating-system
assisted power-down and restoration software is immune to unexpected changes in these register
values.
Table 34-2. Processor Signatures and 64-bit SMRAM State Save Map Format
DisplayFamily_DisplayModel Processor Families/Processor Number Series
06_17H Intel Xeon Processor 5200, 5400 series, Intel Core 2 Quad processor Q9xxx, Intel Core 2 Duo
processors E8000, T9000,
06_0FH Intel Xeon Processor 3000, 3200, 5100, 5300, 7300 series, Intel Core 2 Quad, Intel Core 2 Extreme,
Intel Core 2 Duo processors, Intel Pentium dual-core processors
06_1CH 45 nm Intel® Atom™ processors
34-6 Vol. 3C
SYSTEM MANAGEMENT MODE
Vol. 3C 34-7
SYSTEM MANAGEMENT MODE
Table 34-3. SMRAM State Save Map for Intel 64 Architecture (Contd.)
Offset Register Writable?
(Added to SMBASE + 8000H)
7EF7H - 7EE4H Reserved No
7EE0H Setting of “enable EPT” VM-execution control No
7ED8H Value of EPTP VM-execution control field No
7ED7H - 7ECC0H Reserved No
7EC8H SSP Yes
7EC7H - 7EA0H Reserved No
7E9CH LDT Base (lower 32 bits) No
7E98H Reserved No
7E94H IDT Base (lower 32 bits) No
7E90H Reserved No
7E8CH GDT Base (lower 32 bits) No
7E8BH - 7E44H Reserved No
7E40H CR4 No
7E3FH - 7DF0H Reserved No
7DE8H IO_RIP Yes
7DE7H - 7DDCH Reserved No
7DD8H IDT Base (Upper 32 bits) No
7DD4H LDT Base (Upper 32 bits) No
7DD0H GDT Base (Upper 32 bits) No
7DCFH - 7C00H Reserved No
NOTE:
1. The two most significant bytes are reserved.
34-8 Vol. 3C
SYSTEM MANAGEMENT MODE
located in a dedicated region of system memory (typically in high memory). This SMRAM section can be cached for
optimum processor performance.
For systems that explicitly flush the caches upon entering SMM (the third method described above), the cache flush
can be accomplished by asserting the FLUSH# pin at the same time as the request to enter SMM (generally initi-
ated by asserting the SMI# pin). The priorities of the FLUSH# and SMI# pins are such that the FLUSH# is serviced
first. To guarantee this behavior, the processor requires that the following constraints on the interaction of FLUSH#
and SMI# be met. In a system where the FLUSH# and SMI# pins are synchronous and the set up and hold times
are met, then the FLUSH# and SMI# pins may be asserted in the same clock. In asynchronous systems, the
FLUSH# pin must be asserted at least one clock before the SMI# pin to guarantee that the FLUSH# pin is serviced
first.
Upon leaving SMM (for systems that explicitly flush the caches), the WBINVD instruction should be executed prior
to leaving SMM to flush the caches.
NOTES
In systems based on the Pentium processor that use the FLUSH# pin to write back and invalidate
cache contents before entering SMM, the processor will prefetch at least one cache line in between
when the Flush Acknowledge cycle is run and the subsequent recognition of SMI# and the assertion
of SMIACT#.
It is the obligation of the system to ensure that these lines are not cached by returning KEN#
inactive to the Pentium processor.
Vol. 3C 34-9
SYSTEM MANAGEMENT MODE
• Near jumps and calls can be made to anywhere in the 4-GByte address space if a 32-bit operand-size override
prefix is used. Due to the real-address-mode style of base-address formation, a far call or jump cannot transfer
control to a segment with a base address of more than 20 bits (1 MByte). However, since the segment limit in
SMM is 4 GBytes, offsets into a segment that go beyond the 1-MByte limit are allowed when using 32-bit
operand-size override prefixes. Any program control transfer that does not have a 32-bit operand-size override
prefix truncates the EIP value to the 16 low-order bits.
• Data and the stack can be located anywhere in the 4-GByte address space, but can be accessed only with a 32-
bit address-size override if they are located above 1 MByte. As with the code segment, the base address for a
data or stack segment cannot be more than 20 bits.
The value in segment register CS is automatically set to the default of 30000H for the SMBASE shifted 4 bits to the
right; that is, 3000H. The EIP register is set to 8000H. When the EIP value is added to shifted CS value (the
SMBASE), the resulting linear address points to the first instruction of the SMI handler.
The other segment registers (DS, SS, ES, FS, and GS) are cleared to 0 and their segment limits are set to 4 GBytes.
In this state, the SMRAM address space may be treated as a single flat 4-GByte linear address space. If a segment
register is loaded with a 16-bit value, that value is then shifted left by 4 bits and loaded into the segment base
(hidden part of the segment register). The limits and attributes are not modified.
Maskable hardware interrupts, exceptions, NMI interrupts, SMI interrupts, A20M interrupts, single-step traps,
breakpoint traps, and INIT operations are inhibited when the processor enters SMM. Maskable hardware interrupts,
exceptions, single-step traps, and breakpoint traps can be enabled in SMM if the SMM execution environment
provides and initializes an interrupt table and the necessary interrupt and exception handlers (see Section 34.6).
34-10 Vol. 3C
SYSTEM MANAGEMENT MODE
Vol. 3C 34-11
SYSTEM MANAGEMENT MODE
Table 34-5. I/O Instruction Information in the SMM State Save Map
State (SMM Rev. ID: 30004H or higher) Format
31 16 15 8 7 4 3 1 0
I/0 State Field
I/O Port
Reserved
I/O Type
I/O Length
IO_SMI
SMRAM offset 7FA4
31 0
I/O Memory Address Field I/O Memory Address
SMRAM offset 7FA0
34-12 Vol. 3C
SYSTEM MANAGEMENT MODE
Register Offset
7EFCH
31 18 17 16 15 0
Reserved SMM Revision Identifier
SMBASE Relocation
I/O Instruction Restart
Vol. 3C 34-13
SYSTEM MANAGEMENT MODE
The upper word of the SMM revision identifier refers to the extensions available. If the I/O instruction restart flag
(bit 16) is set, the processor supports the I/O instruction restart (see Section 34.12); if the SMBASE relocation flag
(bit 17) is set, SMRAM base address relocation is supported (see Section 34.11).
15 1 0
Register Offset
Reserved
7F02H
These options are summarized in Table 34-7. If the processor was not in a HALT state when the SMI was received
(the auto HALT restart flag is cleared), setting the flag to 1 will cause unpredictable behavior when the RSM instruc-
tion is executed.
If the HLT instruction is restarted, the processor will generate a memory access to fetch the HLT instruction (if it is
not in the internal cache), and execute a HLT bus transaction. This behavior results in multiple HLT bus transactions
for the same HLT instruction.
34-14 Vol. 3C
SYSTEM MANAGEMENT MODE
31 0
Register Offset
SMM Base
7EF8H
In multiple-processor systems, initialization software must adjust the SMBASE value for each processor so that the
SMRAM state save areas for each processor do not overlap. (For Pentium and Intel486 processors, the SMBASE
values must be aligned on a 32-KByte boundary or the processor will enter shutdown state during the execution of
a RSM instruction.)
If the SMBASE relocation flag in the SMM revision identifier field is set, it indicates the ability to relocate the
SMBASE (see Section 34.9).
15 0
I/O Instruction Restart Field Register Offset
7F00H
If the I/O instruction restart field contains the value 00H when the RSM instruction is executed, then the processor
begins program execution with the instruction following the I/O instruction. (When a repeat prefix is being used,
the next instruction may be the next I/O instruction in the repeat loop.) Not re-executing the interrupted I/O
instruction is the default behavior; the processor automatically initializes the I/O instruction restart field to 00H
upon entering SMM. Table 34-8 summarizes the states of the I/O instruction restart field.
Vol. 3C 34-15
SYSTEM MANAGEMENT MODE
The I/O instruction restart mechanism does not indicate the cause of the SMI. It is the responsibility of the SMI
handler to examine the state of the processor to determine the cause of the SMI and to determine if an I/O instruc-
tion was interrupted and should be restarted upon exiting SMM. If an SMI interrupt is signaled on a non-I/O instruc-
tion boundary, setting the I/O instruction restart field to FFH prior to executing the RSM instruction will likely result
in a program error.
34.12.1 Back-to-Back SMI Interrupts When I/O Instruction Restart Is Being Used
If an SMI interrupt is signaled while the processor is servicing an SMI interrupt that occurred on an I/O instruction
boundary, the processor will service the new SMI request before restarting the originally interrupted I/O instruc-
tion. If the I/O instruction restart field is set to FFH prior to returning from the second SMI handler, the EIP will point
to an address different from the originally interrupted I/O instruction, which will likely lead to a program error. To
avoid this situation, the SMI handler must be able to recognize the occurrence of back-to-back SMI interrupts when
I/O instruction restart is being used and insure that the handler sets the I/O instruction restart field to 00H prior to
returning from the second invocation of the SMI handler.
34.14 DEFAULT TREATMENT OF SMIS AND SMM WITH VMX OPERATION AND
SMX OPERATION
Under the default treatment, the interactions of SMIs and SMM with VMX operation are few. This section details
those interactions. It also explains how this treatment affects SMX operation.
34-16 Vol. 3C
SYSTEM MANAGEMENT MODE
enter SMM;
save the following internal to the processor:
CR4.VMXE
an indication of whether the logical processor was in VMX operation (root or non-root)
IF the logical processor is in VMX operation
THEN
save current VMCS pointer internal to the processor;
leave VMX operation;
save VMX-critical state defined below;
FI;
IF the logical processor supports SMX operation
THEN
save internal to the logical processor an indication of whether the Intel® TXT private space is locked;
IF the TXT private space is unlocked
THEN lock the TXT private space;
FI;
FI;
CR4.VMXE ← 0;
perform ordinary SMI delivery:
save processor state in SMRAM;
set processor state to standard SMM values;1
invalidate linear mappings and combined mappings associated with VPID 0000H (for all PCIDs); combined mappings for VPID 0000H
are invalidated for all EP4TA values (EP4TA is the value of bits 51:12 of EPTP; see Section 28.3);
The pseudocode above makes reference to the saving of VMX-critical state. This state consists of the following:
(1) SS.DPL (the current privilege level); (2) RFLAGS.VM2; (3) the state of blocking by STI and by MOV SS (see
Table 24-3 in Section 24.4.2); (4) the state of virtual-NMI blocking (only if the processor is in VMX non-root oper-
ation and the “virtual NMIs” VM-execution control is 1); and (5) an indication of whether an MTF VM exit is pending
(see Section 25.5.2). These data may be saved internal to the processor or in the VMCS region of the current
VMCS. Processors that do not support SMI recognition while there is blocking by STI or by MOV SS need not save
the state of such blocking.
If the logical processor supports the 1-setting of the “enable EPT” VM-execution control and the logical processor
was in VMX non-root operation at the time of an SMI, it saves the value of that control into bit 0 of the 32-bit field
at offset SMBASE + 8000H + 7EE0H (SMBASE + FEE0H; see Table 34-3).3 If the logical processor was not in VMX
non-root operation at the time of the SMI, it saves 0 into that bit. If the logical processor saves 1 into that bit (it
was in VMX non-root operation and the “enable EPT” VM-execution control was 1), it saves the value of the EPT
pointer (EPTP) into the 64-bit field at offset SMBASE + 8000H + 7ED8H (SMBASE + FED8H).
Because SMI delivery causes a logical processor to leave VMX operation, all the controls associated with VMX non-
root operation are disabled in SMM and thus cannot cause VM exits while the logical processor in SMM.
1. This causes the logical processor to block INIT signals, NMIs, and SMIs.
2. Section 34.14 and Section 34.15 use the notation RAX, RIP, RSP, RFLAGS, etc. for processor registers because most processors that
support VMX operation also support Intel 64 architecture. For processors that do not support Intel 64 architecture, this notation
refers to the 32-bit forms of these registers (EAX, EIP, ESP, EFLAGS, etc.). In a few places, notation such as EAX is used to refer spe-
cifically to the lower 32 bits of the register.
3. “Enable EPT” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution controls
is 0, SMI functions as the “enable EPT” VM-execution control were 0. See Section 24.6.2.
Vol. 3C 34-17
SYSTEM MANAGEMENT MODE
1. If the RSM is to VMX non-root operation and both the “unrestricted guest” VM-execution control and bit 31 of the primary proces-
sor-based VM-execution controls will be 1, CR0.PE and CR0.PG retain the values that were loaded from SMRAM regardless of what is
reported in the capability MSR IA32_VMX_CR0_FIXED0.
2. “Unrestricted guest” is a secondary processor-based VM-execution control. If bit 31 of the primary processor-based VM-execution
controls is 0, VM entry functions as if the “unrestricted guest” VM-execution control were 0. See Section 24.6.2.
34-18 Vol. 3C
SYSTEM MANAGEMENT MODE
• System-management interrupts (SMIs), INIT signals, and higher priority events take priority over these MTF
VM exits. These MTF VM exits take priority over debug-trap exceptions and lower priority events.
• These MTF VM exits wake the logical processor if RSM caused the logical processor to enter the HLT state (see
Section 34.10). They do not occur if the logical processor just entered the shutdown state.
1. Setting IA32_SMM_MONITOR_CTL[bit 2] to 1 prevents VMXOFF from unblocking SMIs regardless of the value of the register’s valid
bit (bit 0).
Vol. 3C 34-19
SYSTEM MANAGEMENT MODE
34-20 Vol. 3C
SYSTEM MANAGEMENT MODE
Table 34-9. Exit Qualification for SMIs That Arrive Immediately After the Retirement of an I/O Instruction
prefix) at the time the instruction started. If the relevant segment is not usable, the value is undefined. On
processors that support Intel 64 architecture, bits 63:32 are clear if the logical processor was not in 64-bit
mode before the VM exit.
• I/O RCX, I/O RSI, I/O RDI, and I/O RIP. For an SMM VM exit due an SMI that arrives immediately after
the retirement of an I/O instruction, these fields receive the values that were in RCX, RSI, RDI, and RIP, respec-
tively, before the I/O instruction executed. Thus, the value saved for I/O RIP addresses the I/O instruction.
1. In this situation, the value of this MSR was saved earlier into the guest-state area. All VM exits save this MSR if the 1-setting of the
“load IA32_RTIT_CTL” VM-entry control is supported (see Section 27.3.1), which must be the case if the “Intel PT uses guest physi-
cal addresses” VM-execution control is 1 (see Section 26.2.1.1).
Vol. 3C 34-21
SYSTEM MANAGEMENT MODE
SMM VM exits invalidate linear mappings and combined mappings associated with VPID 0000H for all PCIDs.
Combined mappings for VPID 0000H are invalidated for all EP4TA values (EP4TA is the value of bits 51:12 of EPTP;
see Section 28.3). (Ordinary VM exits are not required to perform such invalidation if the “enable VPID” VM-execu-
tion control is 1; see Section 27.5.5.)
1. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address
width is returned in bits 7:0 of EAX.
2. If IA32_VMX_BASIC[48] is read as 1, this pointer must not set any bits in the range 63:32; see Appendix A.1.
3. The STM can determine the VMXON pointer by reading the executive-VMCS pointer field in the current VMCS after the SMM VM exit
that activates the dual-monitor treatment.
34-22 Vol. 3C
SYSTEM MANAGEMENT MODE
• If the executive-VMCS pointer field contains the VMXON pointer (the VM entry remains in VMX root operation),
the checks are not performed at all.
• If the executive-VMCS pointer field does not contain the VMXON pointer (the VM entry enters VMX non-root
operation), the checks are performed on the VM-execution control fields in the executive VMCS (the VMCS
referenced by the executive-VMCS pointer field in the current VMCS). These checks are performed after
checking the executive-VMCS pointer field itself (for proper alignment).
Other VM entries ensure that, if “activate VMX-preemption timer” VM-execution control is 0, the “save VMX-
preemption timer value” VM-exit control is also 0. This check is not performed by VM entries that return from SMM.
1. The STM can determine the VMXON pointer by reading the executive-VMCS pointer field in the current VMCS after the SMM VM exit
that activates the dual-monitor treatment.
Vol. 3C 34-23
SYSTEM MANAGEMENT MODE
1. A logical processor is in SMX operation if GETSEC[SEXIT] has not been executed since the last execution of GETSEC[SENTER]. A logi-
cal processor is outside SMX operation if GETSEC[SENTER] has not been executed or if GETSEC[SEXIT] was executed after the last
execution of GETSEC[SENTER]. See Chapter 6, “Safer Mode Extensions Reference‚” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2D.
34-24 Vol. 3C
SYSTEM MANAGEMENT MODE
8 GDTR limit
16 CS selector
20 EIP offset
24 ESP offset
28 CR3 offset
1. Software should consult the VMX capability MSR IA32_VMX_BASIC (see Appendix A.1) to determine whether the dual-monitor
treatment is supported.
Vol. 3C 34-25
SYSTEM MANAGEMENT MODE
To ensure proper behavior in VMX operation, software should maintain the MSEG header in writeback cacheable
memory. Future implementations may allow or require a different memory type.1 Software should consult the VMX
capability MSR IA32_VMX_BASIC (see Appendix A.1).
SMM code should enable the dual-monitor treatment (by setting the valid bit in IA32_SMM_MONITOR_CTL MSR)
only after establishing the content of the MSEG header as follows:
• Bytes 3:0 contain the MSEG revision identifier. Different processors may use different MSEG revision identi-
fiers. These identifiers enable software to avoid using an MSEG header formatted for one processor on a
processor that uses a different format. Software can discover the MSEG revision identifier that a processor uses
by reading the VMX capability MSR IA32_VMX_MISC (see Appendix A.6).
• Bytes 7:4 contain the SMM-transfer monitor features field. Bits 31:1 of this field are reserved and must be
zero. Bit 0 of the field is the IA-32e mode SMM feature bit. It indicates whether the logical processor will be
in IA-32e mode after the STM is activated (see Section 34.15.6).
• Bytes 31:8 contain fields that determine how processor state is loaded when the STM is activated (see Section
34.15.6.5). SMM code should establish these fields so that activating of the STM invokes the STM’s initialization
code.
1. Alternatively, software may map the MSEG header with the UC memory type; this may be necessary, depending on how memory is
organized. Doing so is strongly discouraged unless necessary as it will cause the performance of transitions using those structures
to suffer significantly. In addition, the processor will continue to use the memory type reported in the VMX capability MSR
IA32_VMX_BASIC with exceptions noted in Appendix A.1.
2. Software should consult the VMX capability MSR IA32_VMX_BASIC (see Appendix A.1) to determine whether the dual-monitor
treatment is supported.
34-26 Vol. 3C
SYSTEM MANAGEMENT MODE
Vol. 3C 34-27
SYSTEM MANAGEMENT MODE
— Bits 4:3 are set to bits 4:3 of the CR3-offset field in the MSEG header.
• CR4 is set as follows:
— MCE, PGE, CET, and PCIDE are cleared.
— PAE is set to the value of the IA-32e mode SMM feature bit.
— If the IA-32e mode SMM feature bit is clear, PSE is set to 1 if supported by the processor; if the bit is set,
PSE is cleared.
— All other bits are unchanged.
• DR7 is set to 400H.
• The IA32_DEBUGCTL MSR is cleared to 00000000_00000000H.
• The registers CS, SS, DS, ES, FS, and GS are loaded as follows:
— All registers are usable.
— CS.selector is loaded from the corresponding field in the MSEG header (the high 16 bits are ignored), with
bits 2:0 cleared to 0. If the result is 0000H, CS.selector is set to 0008H.
— The selectors for SS, DS, ES, FS, and GS are set to CS.selector+0008H. If the result is 0000H (if the CS
selector was FFF8H), these selectors are instead set to 0008H.
— The base addresses of all registers are cleared to zero.
— The segment limits for all registers are set to FFFFFFFFH.
— The AR bytes for the registers are set as follows:
• CS.Type is set to 11 (execute/read, accessed, non-conforming code segment).
• For SS, DS, ES, FS, and GS, the Type is set to 3 (read/write, accessed, expand-up data segment).
• The S bits for all registers are set to 1.
• The DPL for each register is set to 0.
• The P bits for all registers are set to 1.
• On processors that support Intel 64 architecture, CS.L is loaded with the value of the IA-32e mode SMM
feature bit.
• CS.D is loaded with the inverse of the value of the IA-32e mode SMM feature bit.
• For each of SS, DS, ES, FS, and GS, the D/B bit is set to 1.
• The G bits for all registers are set to 1.
• LDTR is unusable. The LDTR selector is cleared to 0000H, and the register is otherwise undefined (although the
base address is always canonical)
• GDTR.base is set to the sum of the MSEG base address and the GDTR base-offset field in the MSEG header
(bits 63:32 are always cleared on processors that support IA-32e mode). GDTR.limit is set to the corresponding
field in the MSEG header (the high 16 bits are ignored).
• IDTR.base is unchanged. IDTR.limit is cleared to 0000H.
• RIP is set to the sum of the MSEG base address and the value of the RIP-offset field in the MSEG header
(bits 63:32 are always cleared on logical processors that support IA-32e mode).
• RSP is set to the sum of the MSEG base address and the value of the RSP-offset field in the MSEG header
(bits 63:32 are always cleared on logical processor that supports IA-32e mode).
• RFLAGS is cleared, except bit 1, which is always set.
• The logical processor is left in the active state.
• Event blocking after the SMM VM exit is as follows:
— There is no blocking by STI or by MOV SS.
— There is blocking by non-maskable interrupts (NMIs) and by SMIs.
• There are no pending debug exceptions after the SMM VM exit.
34-28 Vol. 3C
SYSTEM MANAGEMENT MODE
• For processors that support IA-32e mode, the IA32_EFER MSR is modified so that LME and LMA both contain
the value of the IA-32e mode SMM feature bit.
If any of CR3[63:5], CR4.PAE, CR4.PSE, or IA32_EFER.LMA is changing, the TLBs are updated so that, after
VM exit, the logical processor does not use translations that were cached before the transition. This is not neces-
sary for changes that would not affect paging due to the settings of other bits (for example, changes to CR4.PSE if
IA32_EFER.LMA was 1 before and after the transition).
1. A logical processor is in SMX operation if GETSEC[SEXIT] has not been executed since the last execution of GETSEC[SENTER]. A logi-
cal processor is outside SMX operation if GETSEC[SENTER] has not been executed or if GETSEC[SEXIT] was executed after the last
execution of GETSEC[SENTER]. See Chapter 6, “Safer Mode Extensions Reference,” in Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2B.
Vol. 3C 34-29
SYSTEM MANAGEMENT MODE
34-30 Vol. 3C
25.Updates to Chapter 35, Volume 3C
Change bars show changes to Chapter 35 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3C: System Programming Guide, Part 3.
------------------------------------------------------------------------------------------
Changes to chapter: Minor update to Section 35.6 “Tracing and SMM Transfer Monitor (STM)”.
CHAPTER 35
INTEL® PROCESSOR TRACE
35.1 OVERVIEW
Intel Processor Trace (Intel PT) is an extension of Intel® Architecture that captures information about software
®
execution using dedicated hardware facilities that cause only minimal performance perturbation to the software
being traced. This information is collected in data packets. The initial implementations of Intel PT offer control
flow tracing, which generates a variety of packets to be processed by a software decoder. The packets include
timing, program flow information (e.g. branch targets, branch taken/not taken indications) and program-induced
mode related information (e.g. Intel TSX state transitions, CR3 changes). These packets may be buffered internally
before being sent to the memory subsystem or other output mechanism available in the platform. Debug software
can process the trace data and reconstruct the program flow.
Later generations include additional trace sources, including software trace instrumentation using PTWRITE, and
Power Event tracing.
Vol. 3C 35-1
INTEL® PROCESSOR TRACE
— Overflow (OVF) packets: OVF packets are sent when the processor experiences an internal buffer overflow,
resulting in packets being dropped. This packet notifies the decoder of the loss and can help the decoder to
respond to this situation.
• Packets about control flow information:
— Taken Not-Taken (TNT) packets: TNT packets track the “direction” of direct conditional branches (taken or
not taken).
— Target IP (TIP) packets: TIP packets record the target IP of indirect branches, exceptions, interrupts, and
other branches or events. These packets can contain the IP, although that IP value may be compressed by
eliminating upper bytes that match the last IP. There are various types of TIP packets; they are covered in
more detail in Section 35.4.2.2.
— Flow Update Packets (FUP): FUPs provide the source IP addresses for asynchronous events (interrupt and
exceptions), as well as other cases where the source address cannot be determined from the binary.
— MODE packets: These packets provide the decoder with important processor execution information so that
it can properly interpret the dis-assembled binary and trace log. MODE packets have a variety of formats
that indicate details such as the execution mode (16-bit, 32-bit, or 64-bit).
• Packets inserted by software:
— PTWRITE (PTW) packets: includes the value of the operand passed to the PTWRITE instruction (see
“PTWRITE - Write Data to a Processor Trace Packet” in Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2B).
• Packets about processor power management events:
— MWAIT packets: Indicate successful completion of an MWAIT operation to a C-state deeper than C0.0.
— Power State Entry (PWRE) packets: Indicate entry to a C-state deeper than C0.0.
— Power State Exit (PWRX) packets: Indicate exit from a C-state deeper than C0.0, returning to C0.
— Execution Stopped (EXSTOP) packets: Indicate that software execution has stopped, due to events such as
P-state change, C-state change, or thermal throttling.
• Packets containing groups of processor state values:
— Block Begin Packets (BBP): Indicate the type of state held in the following group.
— Block Item Packets (BIP): Indicate the state values held in the group.
— Block End Packets (BEP): Indicate the end of the current group.
35-2 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-3
INTEL® PROCESSOR TRACE
A special case is applied if the target of the RET is consistent with what would be expected from tracking the
CALL stack. If it is assured that the decoder has seen the corresponding CALL (with “corresponding” defined
as the CALL with matching stack depth), and the RET target is the instruction after that CALL, the RET
target may be “compressed”. In this case, only a single TNT bit of “taken” is generated instead of a Target
IP Packet. To ensure that the decoder will not be confused in cases of RET compression, only RETs that
correspond to CALLs which have been seen since the last PSB packet may be compressed in a given logical
processor. For details, see “Indirect Transfer Compression for Returns (RET)” in Section 35.4.2.2.
35-4 Vol. 3C
INTEL® PROCESSOR TRACE
35.2.4.3 Filtering by IP
Trace packet generation with configurable filtering by IP is supported if CPUID.(EAX=14H, ECX=0):EBX[bit 2] = 1.
Intel PT can be configured to enable the generation of packets containing architectural states only when the
processor is executing code within certain IP ranges. If the IP is outside of these ranges, generation of some
packets is blocked.
IP filtering is enabled using the ADDRn_CFG fields in the IA32_RTIT_CTL MSR (Section 35.2.7.2), where the digit
'n' is a zero-based number that selects which address range is being configured. Each ADDRn_CFG field configures
the use of the register pair IA32_RTIT_ADDRn_A and IA32_RTIT_ADDRn_B (Section 35.2.7.5).
IA32_RTIT_ADDRn_A defines the base and IA32_RTIT_ADDRn_B specifies the limit of the range in which tracing is
enabled. Thus each range, referred to as the ADDRn range, is defined by [IA32_RTIT_ADDRn_A,
IA32_RTIT_ADDRn_B]. There can be multiple such ranges, software can query CPUID (Section 35.3.1) for the
number of ranges supported on a processor.
Default behavior (ADDRn_CFG=0) defines no IP filter range, meaning FilterEn is always set. In this case code at
any IP can be traced, though other filters, such as CR3 or CPL, could limit tracing. When ADDRn_CFG is set to
enable IP filtering (see Section 35.3.1), tracing will commence when a taken branch or event is seen whose target
address is in the ADDRn range.
While inside a tracing region and with FilterEn is set, leaving the tracing region may only be detected once a taken
branch or event with a target outside the range is retired. If an ADDRn range is entered or exited by executing the
next sequential instruction, rather than by a control flow transfer, FilterEn may not toggle immediately. See Section
35.2.5.5 for more details on FilterEn.
Note that these address range base and limit values are inclusive, such that the range includes the first and last
instruction whose first instruction byte is in the ADDRn range.
Vol. 3C 35-5
INTEL® PROCESSOR TRACE
Depending upon processor implementation, IP filtering may be based on linear or effective address. This can cause
different behavior between implementations if CSbase is not equal to zero or in real mode. See Section 35.3.1.1 for
details. Software can query CPUID to determine filters are based on linear or effective address (Section 35.3.1).
Note that some packets, such as MTC (Section 35.3.7) and other timing packets, do not depend on FilterEn. For
details on which packets depend on FilterEn, and hence are impacted by IP filtering, see Section 35.4.1.
TraceStop
The ADDRn ranges can also be configured to cause tracing to be disabled upon entry to the specified region. This is
intended for cases where unexpected code is executed, and the user wishes to immediately stop generating
packets in order to avoid overwriting previously written packets.
The TraceStop mechanism works much the same way that IP filtering does, and uses the same address comparison
logic. The TraceStop region base and limit values are programmed into one or more ADDRn ranges, but
IA32_RTIT_CTL.ADDRn_CFG is configured with the TraceStop encoding. Like FilterEn, TraceStop is detected when
a taken branch or event lands in a TraceStop region.
Further, TraceStop requires that TriggerEn=1 at the beginning of the branch/event, and ContextEn=1 upon
completion of the branch/event. When this happens, the CPU will set IA32_RTIT_STATUS.Stopped, thereby
clearing TriggerEn and hence disabling packet generation. This may generate a TIP.PGD packet with the target IP
of the branch or event that entered the TraceStop region. Finally, a TraceStop packet will be inserted, to indicate
that the condition was hit.
If a TraceStop condition is encountered during buffer overflow (Section 35.3.8), it will not be dropped, but will
instead be signaled once the overflow has resolved.
Note that a TraceStop event does not guarantee that all internally buffered packets are flushed out of internal
buffers. To ensure that this has occurred, the user should clear TraceEn.
To resume tracing after a TraceStop event, the user must first disable Intel PT by clearing IA32_RTIT_CTL.TraceEn
before the IA32_RTIT_STATUS.Stopped bit can be cleared. At this point Intel PT can be reconfigured, and tracing
resumed.
Note that the IA32_RTIT_STATUS.Stopped bit can also be set using the ToPA STOP bit. See Section 35.2.6.2.
IP Filtering Example
The following table gives an example of IP filtering behavior. Assume that IA32_RTIT_ADDRn_A = the IP of Range-
Base, and that IA32_RTIT_ADDRn_B = the IP of RangeLimit, while IA32_RTIT_CTL.ADDRn_CFG = 0x1 (enable
ADDRn range as a FilterEn range).
35-6 Vol. 3C
INTEL® PROCESSOR TRACE
1. Trace packets generation is disabled in a production enclave, see Section 35.2.8.5. See Intel® Software Guard
Extensions Programming Reference about differences between a production enclave and a debug enclave.
Vol. 3C 35-7
INTEL® PROCESSOR TRACE
35-8 Vol. 3C
INTEL® PROCESSOR TRACE
This output method is best suited for cases where Intel PT output is either:
• Configured to be directed to a sufficiently large contiguous region of DRAM.
• Configured to go to an MMIO debug port, in order to route Intel PT output to a platform-specific trace endpoint
(e.g., JTAG). In this scenario, a specific range of addresses is written in a circular manner, and SoC will
intercept these writes and direct them to the proper device. Repeated writes to the same address do not
overwrite each other, but are accumulated by the debugger, and hence no data is lost by the circular nature of
the buffer.
The processor will determine the address to which to write the next trace packet output byte as follows:
OutputBase[63:0] IA32_RTIT_OUTPUT_BASE[63:0]
OutputMask[63:0] ZeroExtend64(IA32_RTIT_OUTPUT_MASK_PTRS[31:0])
OutputOffset[63:0] ZeroExtend64(IA32_RTIT_OUTPUT_MASK_PTRS[63:32])
trace_store_phys_addr (OutputBase & ~OutputMask) + (OutputOffset & OutputMask)
Vol. 3C 35-9
INTEL® PROCESSOR TRACE
• proc_trace_output_offset.
This a pointer into the current output region and indicates the location of the next write. When tracing is
enabled, the processor loads this value from bits 63:32 (OutputOffset) of the
IA32_RTIT_OUTPUT_MASK_PTRS. While tracing is enabled, the processor updates
IA32_RTIT_OUTPUT_MASK_PTRS.OutputOffset with changes to proc_trace_output_offset, but these updates
may not be synchronous to software execution. When tracing is disabled, the processor ensures that the MSR
contains the latest value of proc_trace_output_offset.
Figure 35-1 provides an illustration (not to scale) of the table and associated pointers.
0FF_FFFF _FFFFH
Physical Memory
END=1 TableBaseB
4K OutputBaseY
64K OutputBaseX proc_trace_table_offset:
IA32_RTIT_OUTPUT_MASK_PRS.MaskOrTableOffset<<3
ToPA Table A
proc_trace_table_base: IA32_RTIT_OUTPUT_BASE
OutputRegionY
STOP=1
ToPA Table B
With the ToPA mechanism, the processor writes packets to the current output region (identified by
proc_trace_table_base and the proc_trace_table_offset). The offset within that region to which the next byte will
be written is identified by proc_trace_output_offset. When that region is filled with packet output (thus
proc_trace_output_offset = RegionSize–1), proc_trace_table_offset is moved to the next ToPA entry,
proc_trace_output_offset is set to 0, and packet writes begin filling the new output region specified by
proc_trace_table_offset.
As packets are written out, each store derives its physical address as follows:
trace_store_phys_addr Base address from current ToPA table entry +
proc_trace_output_offset
Eventually, the regions represented by all entries in the table may become full, and the final entry of the table is
reached. An entry can be identified as the final entry because it has either the END or STOP attribute. The END
attribute indicates that the address in the entry does not point to another output region, but rather to another ToPA
table. The STOP attribute indicates that tracing will be disabled once the corresponding region is filled. See Table
35-3 and the section that follows for details on STOP.
When an END entry is reached, the processor loads proc_trace_table_base with the base address held in this END
entry, thereby moving the current table pointer to this new table. The proc_trace_table_offset is reset to 0, as is
the proc_trace_output_offset, and packet writes will resume at the base address indicated in the first entry.
If the table has no STOP or END entry, and trace-packet generation remains enabled, eventually the maximum
table size will be reached (proc_trace_table_offset = 0FFFFFF8H). In this case, the proc_trace_table_offset and
35-10 Vol. 3C
INTEL® PROCESSOR TRACE
proc_trace_output_offset are reset to 0 (wrapping back to the beginning of the current table) once the last output
region is filled.
It is important to note that processor updates to the IA32_RTIT_OUTPUT_BASE and
IA32_RTIT_OUTPUT_MASK_PTRS MSRs are asynchronous to instruction execution. Thus, reads of these MSRs
while Intel PT is enabled may return stale values. Like all IA32_RTIT_* MSRs, the values of these MSRs should not
be trusted or saved unless trace packet generation is first disabled by clearing IA32_RTIT_CTL.TraceEn. This
ensures that the output MSR values account for all packets generated to that point, after which the processor will
cease updating the output MSR values until tracing resumes. 1
The processor may cache internally any number of entries from the current table or from tables that it references
(directly or indirectly). If tracing is enabled, the processor may ignore or delay detection of modifications to these
tables. To ensure that table changes are detected by the processor in a predictable manner, software should clear
TraceEn before modifying the current table (or tables that it references) and only then re-enable packet genera-
tion.
63 MAXPHYADDR–1 12 11 10 9 6 5 4 3 2 1 0
9:6 Size
4 : STOP
2 : INT Reserved
0 : END
Table 35-3 describes the details of the ToPA table entry fields. If reserved bits are set to 1, an error is signaled.
1. Although WRMSR is a serializing instruction, the execution of WRMSR that forces packet writes by clearing TraceEn does not itself
cause these writes to be globally observed.
Vol. 3C 35-11
INTEL® PROCESSOR TRACE
ToPA STOP
Each ToPA entry has a STOP bit. If this bit is set, the processor will set the IA32_RTIT_STATUS.Stopped bit when
the corresponding trace output region is filled. This will clear TriggerEn and thereby cease packet generation. See
Section 35.2.7.4 for details on IA32_RTIT_STATUS.Stopped. This sequence is known as “ToPA Stop”.
No TIP.PGD packet will be seen in the output when the ToPA stop occurs, since the disable happens only when the
region is already full. When this occurs, output ceases after the last byte of the region is filled, which may mean
that a packet is cut off in the middle. Any packets remaining in internal buffers are lost and cannot be recovered.
When ToPA stop occurs, the IA32_RTIT_OUTPUT_BASE MSR will hold the base address of the table whose entry
had STOP=1. IA32_RTIT_OUTPUT_MASK_PTRS.MaskOrTableOffset will hold the index value for that entry, and the
IA32_RTIT_OUTPUT_MASK_PTRS.OutputOffset should be set to the size of the region.
Note that this means the offset pointer is pointing to the next byte after the end of the region, a configuration that
would produce an operational error if the configuration remained when tracing is re-enabled with
IA32_RTIT_STATUS.Stopped cleared.
ToPA PMI
Each ToPA entry has an INT bit. If this bit is set, the processor will signal a performance-monitoring interrupt (PMI)
when the corresponding trace output region is filled. This interrupt is not precise, and it is thus likely that writes to
the next region will occur by the time the interrupt is taken.
The following steps should be taken to configure this interrupt:
1. Enable PMI via the LVT Performance Monitor register (at MMIO offset 340H in xAPIC mode; via MSR 834H in
x2APIC mode). See Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B for more
details on this register. For ToPA PMI, set all fields to 0, save for the interrupt vector, which can be selected by
software.
2. Set up an interrupt handler to service the interrupt vector that a ToPA PMI can raise.
3. Set the interrupt flag by executing STI.
4. Set the INT bit in the ToPA entry of interest and enable packet generation, using the ToPA output option. Thus,
TraceEn=ToPA=1 in the IA32_RTIT_CTL MSR.
Once the INT region has been filled with packet output data, the interrupt will be signaled. This PMI can be distin-
guished from others by checking bit 55 (Trace_ToPA_PMI) of the IA32_PERF_GLOBAL_STATUS MSR (MSR 38EH).
Once the ToPA PMI handler has serviced the relevant buffer, writing 1 to bit 55 of the MSR at 390H
(IA32_GLOBAL_STATUS_RESET) clears IA32_PERF_GLOBAL_STATUS.Trace_ToPA_PMI.
35-12 Vol. 3C
INTEL® PROCESSOR TRACE
Intel PT is not frozen on PMI, and thus the interrupt handler will be traced (though filtering can prevent this). The
Freeze_Perfmon_on_PMI and Freeze_LBRs_on_PMI settings in IA32_DEBUGCTL will be applied on ToPA PMI just
as on other PMIs, and hence Perfmon counters are frozen.
Assuming the PMI handler wishes to read any buffered packets for persistent output, or wishes to modify any Intel
PT MSRs, software should first disable packet generation by clearing TraceEn. This ensures that all buffered
packets are written to memory and avoids tracing of the PMI handler. The configuration MSRs can then be used to
determine where tracing has stopped. If packet generation is disabled by the handler, it should then be manually
re-enabled before the IRET if continued tracing is desired.
In rare cases, it may be possible to trigger a second ToPA PMI before the first is handled. This can happen if another
ToPA region with INT=1 is filled before, or shortly after, the first PMI is taken, perhaps due to EFLAGS.IF being
cleared for an extended period of time. This can manifest in two ways: either the second PMI is triggered before the
first is taken, and hence only one PMI is taken, or the second is triggered after the first is taken, and thus will be
taken when the handler for the first completes. Software can minimize the likelihood of the second case by clearing
TraceEn at the beginning of the PMI handler. Further, it can detect such cases by then checking the Interrupt
Request Register (IRR) for PMI pending, and checking the ToPA table base and off-set pointers (in
IA32_RTIT_OUTPUT_BASE and IA32_RTIT_OUTPUT_MASK_PTRS) to see if multiple entries with INT=1 have been
filled.
When IA32_RTIT_CTL.InjectPsbPmiOnEnable[56] = 1, the PMI handler should take the following actions:
1. Ignore ToPA PMIs that are taken when TraceEn = 0, because the Intel PT MSR state may have already been
saved by XSAVES, and because the PMI will be re-injected when Intel PT is re-enabled.
2. Clear the new IA32_RTIT_STATUS.PendTopaPMI[7] bit once the PMI has been handled. This bit should not be
cleared in cases where a PMI is ignored due to TraceEn = 0.
Vol. 3C 35-13
INTEL® PROCESSOR TRACE
ToPA Errors
When a malformed ToPA entry is found, an operational error results (see Section 35.3.9). A malformed entry can
be any of the following:
1. ToPA entry reserved bit violation.
This describes cases where a bit marked as reserved in Section 35.2.6.2 above is set to 1.
2. ToPA alignment violation.
This includes cases where illegal ToPA entry base address bits are set to 1:
a. ToPA table base address is not 4KB-aligned. The table base can be from a WRMSR to
IA32_RTIT_OUTPUT_BASE, or from a ToPA entry with END=1.
b. ToPA entry base address is not aligned to the ToPA entry size (e.g., a 2MB region with base address[20:12]
not equal to 0), for ToPA entries with END=0.
c. ToPA entry base address sets upper physical address bits not supported by the processor.
3. Illegal ToPA Output Offset.
IA32_RTIT_OUTPUT_MASK_PTRS.OutputOffset is greater than or equal to the size of the current ToPA output
region size.
4. ToPA rules violations.
These are similar to ToPA entry reserved bit violations; they are cases when a ToPA entry is encountered with
illegal field combinations. They include the following:
a. Setting the STOP or INT bit on an entry with END=1.
b. Setting the END bit in entry 0 of a ToPA table.
c. On processors that support only a single ToPA entry (see above), two additional illegal settings apply:
i) ToPA table entry 1 with END=0.
ii) ToPA table entry 1 with base address not matching the table base.
35-14 Vol. 3C
INTEL® PROCESSOR TRACE
In all cases, the error will be logged by setting IA32_RTIT_STATUS.Error, thereby disabling tracing when the prob-
lematic ToPA entry is reached (when proc_trace_table_offset points to the entry containing the error). Any packet
bytes that are internally buffered when the error is detected may be lost.
Note that operational errors may also be signaled due to attempts to access restricted memory. See Section
35.2.6.4 for details.
A tracing software have a range of flexibility using ToPA to manage the interaction of Intel PT with application
buffers, see Section 35.4.2.26.
It should also be noted that packet output should not be routed to the 4KB APIC MMIO region, as defined by the
IA32_APIC_BASE MSR. For details about the APIC, refer to Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A. No error is signaled for this case.
Vol. 3C 35-15
INTEL® PROCESSOR TRACE
Note: Software may write the same value back to IA32_RTIT_CTL without #GP, even if TraceEn=1.
• All configuration MSRs for Intel PT are duplicated per logical processor
• For each configuration MSR, any MSR write that attempts to change bits marked reserved, or utilize encodings
marked reserved, will cause a #GP fault.
• All configuration MSRs for Intel PT are cleared on a cold RESET.
— If CPUID.(EAX=14H, ECX=0):EBX.IPFILT_WRSTPRSV[bit 2] = 1, only the TraceEn bit is cleared on warm
RESET; though this may have the impact of clearing other bits in IA32_RTIT_STATUS. Other MSR values of
the trace configuration MSRs are preserved on warm RESET.
• The semantics of MSR writes to trace configuration MSRs in this chapter generally apply to explicit WRMSR to
these registers, using VM-exit or VM-entry MSR load list to these MSRs, XRSTORS with requested feature bit
map including XSAVE map component of state_8 (corresponding to IA32_XSS[bit 8]), and the write to
IA32_RTIT_CTL.TraceEn by XSAVES (Section 35.3.5.2).
35-16 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-17
INTEL® PROCESSOR TRACE
35-18 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-19
INTEL® PROCESSOR TRACE
35-20 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-21
INTEL® PROCESSOR TRACE
35-22 Vol. 3C
INTEL® PROCESSOR TRACE
The CurrentIP listed above is the IP of the associated instruction. The TargetIP is the IP of the next instruction to
be executed; for HLE, this is the XACQUIRE lock; for RTM, this is the fallback handler.
Intel PT stores are non-transactional, and thus packet writes are not rolled back on TSX abort.
Vol. 3C 35-23
INTEL® PROCESSOR TRACE
35-24 Vol. 3C
INTEL® PROCESSOR TRACE
During the enclave execution, Intel PT remains enabled, and periodic or timing packets such as PSB, TSC, MTC, or
CBR can still be generated. No IPs or other architectural state will be exposed.
For packet generation examples on enclave entry or exit, see Section 35.7.
Debug Enclaves
Intel SGX allows an enclave to be configured with relaxed protection of confidentiality for debug purposes, see the
Intel® Software Guard Extensions Programming Reference. In a debug enclave, Intel PT continues to function
normally. Specifically, ContextEn is not impacted by an enclave entry or exit. Hence, the generation of ContextEn-
dependent packets within a debug enclave is allowed.
Vol. 3C 35-25
INTEL® PROCESSOR TRACE
Table 35-11. CPUID Leaf 14H Enumeration of Intel Processor Trace Capabilities
CPUID.(EAX=14H,ECX=0) Name Description Behavior
Register Bits
EAX 31:0 Maximum valid sub-leaf Index Specifies the index of the maximum valid sub-leaf for this CPUID leaf
0 CR3 Filtering Support 1: Indicates that IA32_RTIT_CTL.CR3Filter can be set to 1, and that
IA32_RTIT_CR3_MATCH MSR can be accessed. See Section 35.2.7.
EBX
0: Indicates that writes that set IA32_RTIT_CTL.CR3Filter to 1, or any
access to IA32_RTIT_CR3_MATCH, will #GP fault.
1 Configurable PSB and Cycle- 1: (a) IA32_RTIT_CTL.PSBFreq can be set to a non-zero value, in order to
Accurate Mode Supported select the preferred PSB frequency (see below for allowed values). (b)
IA32_RTIT_STATUS.PacketByteCnt can be set to a non-zero value, and
will be incremented by the processor when tracing to indicate progress
towards the next PSB. If trace packet generation is enabled by setting
TraceEn, a PSB will only be generated if PacketByteCnt=0. (c)
IA32_RTIT_CTL.CYCEn can be set to 1 to enable Cycle-Accurate Mode.
See Section 35.2.7.
0: (a) Any attempt to set IA32_RTIT_CTL.PSBFreq, to set
IA32_RTIT_CTL.CYCEn, or write a non-zero value to
IA32_RTIT_STATUS.PacketByteCnt any access to
IA32_RTIT_CR3_MATCH, will #GP fault. (b) If trace packet generation is
enabled by setting TraceEn, a PSB is always generated. (c) Any attempt
to set IA32_RTIT_CTL.CYCEn will #GP fault.
2 IP Filtering and TraceStop 1: (a) IA32_RTIT_CTL provides at one or more ADDRn_CFG field to
supported, and Preserve Intel configure the corresponding address range MSRs for IP Filtering or IP
PT MSRs across warm reset TraceStop. Each ADDRn_CFG field accepts a value in the range of 0:2
inclusive. The number of ADDRn_CFG fields is reported by
CPUID.(EAX=14H, ECX=1):EAX.RANGECNT[2:0]. (b) At least one register
pair IA32_RTIT_ADDRn_A and IA32_RTIT_ADDRn_B are provided to
configure address ranges for IP filtering or IP TraceStop. (c) On warm
reset, all Intel PT MSRs will retain their pre-reset values, though
IA32_RTIT_CTL.TraceEn will be cleared. The Intel PT MSRs are listed in
Section 35.2.7.
0: (a) An Attempt to write IA32_RTIT_CTL.ADDRn_CFG with non-zero
encoding values will cause #GP. (b) Any access to IA32_RTIT_ADDRn_A
and IA32_RTIT_ADDRn_B, will #GP fault. (c) On warm reset, all Intel PT
MSRs will be cleared.
3 MTC Supported 1: IA32_RTIT_CTL.MTCEn can be set to 1, and MTC packets will be
generated. See Section 35.2.7.
0: An attempt to set IA32_RTIT_CTL.MTCEn or IA32_RTIT_CTL.MTCFreq
to a non-zero value will #GP fault.
4 PTWRITE Supported 1: Writes can set IA32_RTIT_CTL[12] (PTWEn) and IA32_RTIT_CTL[5]
(FUPonPTW), and PTWRITE can generate packets.
0: Writes that set IA32_RTIT_CTL[12] or IA32_RTIT_CTL[5] will #GP,
and PTWRITE will #UD fault.
5 Power Event Trace Supported 1: Writes can set IA32_RTIT_CTL[4] (PwrEvtEn), enabling Power Event
Trace packet generation.
0: Writes that set IA32_RTIT_CTL[4] will #GP.
31:6 Reserved
35-26 Vol. 3C
INTEL® PROCESSOR TRACE
Table 35-11. CPUID Leaf 14H Enumeration of Intel Processor Trace Capabilities (Contd.)
CPUID.(EAX=14H,ECX=0) Name Description Behavior
Register Bits
0 ToPA Output Supported 1: Tracing can be enabled with IA32_RTIT_CTL.ToPA = 1, hence utilizing
the ToPA output scheme (Section 35.2.6.2) IA32_RTIT_OUTPUT_BASE
and IA32_RTIT_OUTPUT_MASK_PTRS MSRs can be accessed.
ECX
0: Unless CPUID.(EAX=14H, ECX=0):ECX.SNGLRNGOUT[bit 2] = 1. writes
to IA32_RTIT_OUTPUT_BASE or IA32_RTIT_OUTPUT_MASK_PTRS.
MSRs will #GP fault.
1 ToPA Tables Allow Multiple 1: ToPA tables can hold any number of output entries, up to the
Output Entries maximum allowed by the MaskOrTableOffset field of
IA32_RTIT_OUTPUT_MASK_PTRS.
0: ToPA tables can hold only one output entry, which must be followed
by an END=1 entry which points back to the base of the table.
Further, ToPA PMIs will be delivered before the region is filled. See ToPA
PMI in Section 35.2.6.2.
If there is more than one output entry before the END entry, or if the
END entry has the wrong base address, an operational error will be
signaled (see “ToPA Errors” in Section 35.2.6.2).
2 Single-Range Output 1: Enabling tracing (TraceEn=1) with IA32_RTIT_CTL.ToPA=0 is
Supported supported.
0: Unless CPUID.(EAX=14H, ECX=0):ECX.TOPAOUT[bit 0] = 1. writes to
IA32_RTIT_OUTPUT_BASE or IA32_RTIT_OUTPUT_MASK_PTRS. MSRs
will #GP fault.
3 Output to Trace Transport 1: Setting IA32_RTIT_CTL.FabricEn to 1 is supported.
Subsystem Supported 0: IA32_RTIT_CTL.FabricEn is reserved. Write 1 to
IA32_RTIT_CTL.FabricEn will #GP fault.
30:4 Reserved
31 IP Payloads are LIP 1: Generated packets which contain IP payloads have LIP values, which
include the CS base component.
0: Generated packets which contain IP payloads have RIP values, which
are the offset from CS base.
EDX 31:0 Reserved
If CPUID.(EAX=14H, ECX=0):EAX reports a non-zero value, additional capabilities of Intel Processor Trace are
described in the sub-leaves of CPUID leaf 14H.
Vol. 3C 35-27
INTEL® PROCESSOR TRACE
Table 35-12. CPUID Leaf 14H, sub-leaf 1H Enumeration of Intel Processor Trace Capabilities
CPUID.(EAX=14H,ECX=1) Name Description Behavior
Register Bits
EAX 2:0 Number of Address Ranges A non-zero value specifies the number ADDRn_CFG field supported in
IA32_RTIT_CTL and the number of register pair
IA32_RTIT_ADDRn_A/IA32_RTIT_ADDRn_B supported for IP filtering
and IP TraceStop.
NOTE: Currently, no processors support more than 4 address ranges.
15:3 Reserved
31:16 Bitmap of supported MTC The non-zero bits indicate the map of supported encoding values for
Period Encodings the IA32_RTIT_CTL.MTCFreq field. This applies only if
CPUID.(EAX=14H, ECX=0):EBX.MTC[bit 3] = 1 (MTC Packet generation is
supported), otherwise the MTCFreq field is reserved to 0.
Each bit position in this field represents 1 encoding value in the 4-bit
MTCFreq field (ie, bit 0 is associated with encoding value 0). For each
bit:
1: MTCFreq can be assigned the associated encoding value.
0: MTCFreq cannot be assigned to the associated encoding value. A
write to IA32_RTIT_CTLMTCFreq with unsupported encoding will cause
#GP fault.
EBX 15:0 Bitmap of supported Cycle The non-zero bits indicate the map of supported encoding values for
Threshold values the IA32_RTIT_CTL.CycThresh field. This applies only if
CPUID.(EAX=14H, ECX=0):EBX.CPSB_CAM[bit 1] = 1 (Cycle-Accurate
Mode is Supported), otherwise the CycThresh field is reserved to 0. See
Section 35.2.7.
Each bit position in this field represents 1 encoding value in the 4-bit
CycThresh field (ie, bit 0 is associated with encoding value 0). For each
bit:
1: CycThresh can be assigned the associated encoding value.
0: CycThresh cannot be assigned to the associated encoding value. A
write to CycThresh with unsupported encoding will cause #GP fault.
31:16 Bitmap of supported The non-zero bits indicate the map of supported encoding values for
Configurable PSB Frequency the IA32_RTIT_CTL.PSBFreq field. This applies only if
encoding CPUID.(EAX=14H, ECX=0):EBX.CPSB_CAM[bit 1] = 1 (Configurable PSB
is supported), otherwise the PSBFreq field is reserved to 0. See
Section 35.2.7.
Each bit position in this field represents 1 encoding value in the 4-bit
PSBFreq field (ie, bit 0 is associated with encoding value 0). For each
bit:
1: PSBFreq can be assigned the associated encoding value.
0: PSBFreq cannot be assigned to the associated encoding value. A
write to PSBFreq with unsupported encoding will cause #GP fault.
ECX 31:0 Reserved
EDX 31:0 Reserved
35-28 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-29
INTEL® PROCESSOR TRACE
35-30 Vol. 3C
INTEL® PROCESSOR TRACE
When restoring the trace configuration context, IA32_RTIT_CTL should be restored last:
1. Read saved configuration MSR values, aside from IA32_RTIT_CTL, from memory, and restore them with
WRMSR
2. Read saved IA32_RTIT_CTL value from memory, and restore with WRMSR.
The IA32_XSS MSR is zero coming out of RESET. Once IA32_XSS[bit 8] is set, system software operating at CPL=
0 can use XSAVES/XRSTORS with the appropriate requested-feature bitmap (RFBM) to manage supervisor state
components in the XSAVE map. See Chapter 13, “Managing State Using the XSAVE Feature Set” of Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 1.
1. Table 35-13 documents support for the MSRs defining address ranges 0 and 1. Processors that provide XSAVE support for Intel Processor
Trace support only those address ranges.
Vol. 3C 35-31
INTEL® PROCESSOR TRACE
35-32 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-33
INTEL® PROCESSOR TRACE
normal binding and ordering rules that apply to these packets outside of PSB+ can be ignored when these packets
are between a PSB and PSBEND. They inform the decoder of the state of the processor at the time of the PSB.
PSB+ can include:
• Timestamp (TSC), if IA32_RTIT_CTL.TSCEn=1.
• Timestamp-MTC Align (TMA), if IA32_RTIT_CTL.TSCEn=1 && IA32_RTIT_CTL.MTCEn=1.
• Paging Information Packet (PIP), if ContextEn=1 and IA32_RTIT_CTL.OS=1. The non-root bit (NR) is set if the
logical processor is in VMX non-root operation and the “conceal VMX from PT” VM-execution control is 0.
• VMCS packet, if either the logical is in VMX root operation or the logical processor is in VMX non-root operation
and the “conceal VMX from PT” VM-execution control is 0.
• Core Bus Ratio (CBR).
• MODE.TSX, if ContextEn=1 and BranchEn = 1.
• MODE.Exec, if PacketEn=1.
• Flow Update Packet (FUP), if PacketEn=1.
PSB is generated only when TriggerEn=1; hence PSB+ has the same dependencies. The ordering of packets within
PSB+ is not fixed. Timing packets such as CYC and MTC may be generated between PSB and PSBEND, and their
meanings are the same as outside PSB+.
A PSB+ can be lost in some scenarios. If IA32_RTIT_STATUS.TriggerEn is cleared just as the PSB threshold is
reached, the PSB+ may not be generated. TriggerEn can be cleared by a WRMSR that clears
IA32_RTIT_CTL.TraceEn, a VM-exit that clears IA32_RTIT_CTL.TraceEn, an #SMI, or any time that either
IA32_RTIT_STATUS.Stopped is set (e.g., by a TraceStop or ToPA stop condition) or IA32_RTIT_STATUS.Error is set
(e.g., by an Intel PT output error). On processors that support PSB preservation (CPUID.(EAX=14H,
ECX=0):EBX.INJECTPSBPMI[6] = 1), setting IA32_RTIT_CTL.InjectPsbPmiOnEnable[56] = 1 will ensure that a
PSB+ that is pending at the time PT is disabled will be recorded by setting IA32_RTIT_STATUS.PendPSB[6] = 1. A
PSB will be inserted, and PendPSB cleared, when PT is later re-enabled while PendPSB = 1.
Note that an overflow can occur during PSB+, and this could cause the PSBEND packet to be lost. For this reason,
the OVF packet should also be viewed as terminating PSB+. If IA32_RTIT_STATUS.TriggerEn is cleared just as the
PSB threshold is reached, the PSB+ may not be generated. TriggerEn can be cleared by a WRMSR that clears
IA32_RTIT_CTL.TraceEn, a VM-exit that clears IA32_RTIT_CTL.TraceEn, an #SMI, or any time that either
IA32_RTIT_STATUS.Stopped is set (e.g., by a TraceStop or ToPA stop condition) or IA32_RTIT_STATUS.Error is set
(e.g., by an Intel PT output error).
35-34 Vol. 3C
INTEL® PROCESSOR TRACE
If a TraceStop event occurs during the buffer overflow, IA32_RTIT_STATUS.Stopped will still be set, tracing will
cease as a result. However, the TraceStop packet, and any TIP.PGD that result from the TraceStop, may be
dropped.
Vol. 3C 35-35
INTEL® PROCESSOR TRACE
indicated by these intermediate packets should be applied at the IP of the TIP* packet. A summary of compound
packet events is provided in Table 35-15; see Section 35.4.1.1 for more per-packet details and Section 35.7 for
more detailed packet generation examples.
It is possible to encounter an internal buffer overflow in the middle of a block. In such a case, it is guaranteed that
packet generation will not resume in the middle of a block, and hence the OVF packet terminates the current block.
Depending on the duration of the overflow, subsequent blocks may also be lost.
Decoder Implications
When a Block Begin Packet (BBP) is encountered, the decoder will need to decode some packets within the block
differently from those outside a block. The Block Item Packet (BIP) header byte has the same encoding as a TNT
packet outside of a block, but must be treated as a BIP header (with following payload) within one.
35-36 Vol. 3C
INTEL® PROCESSOR TRACE
When an OVF packet is encountered, the decoder should treat that as a block ending condition. Packet generation
will not resume within a block.
Header bits
Byte Number Bit Number Payload in White
in Green
7 6 5 4 3 2 1 0
0 0 1 0 1 0 1 0 1
1 1 1 0 0 0 1 1 0
2 0 1 0 0 0 1 1 0
Description of fields
Dependencies Depends on packet generation con- Generation Scenario Which instructions, events, or other
figuration enable controls or other scenarios can cause this packet to be
bits (Section 35.2.5). generated.
Description Description of the packet, including the purpose it serves, meaning of the information or payload, etc
Application How a decoder should apply this packet. It may bind to a specific instruction from the binary, or to
another packet in the stream, or have other implications on decode
Vol. 3C 35-37
INTEL® PROCESSOR TRACE
B1…BN represent the last N conditional branch or compressed RET (Section 35.4.2.2) results, such that B1 is oldest
and BN is youngest. The short TNT packet can contain from 1 to 6 TNT bits. The long TNT packet can contain from
1 to 47 TNT bits.
7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 1 0 Long TNT
1 1 0 1 0 0 0 1 1
2 B40 B41 B42 B43 B44 B45 B46 B47
3 B32 B33 B34 B35 B36 B37 B38 B39
4 B24 B25 B26 B27 B28 B29 B30 B31
5 B16 B17 B18 B19 B20 B21 B22 B23
6 B8 B9 B10 B11 B12 B13 B14 B15
7 1 B1 B2 B3 B4 B5 B6 B7
Irrespective of how many TNT bits is in a packet, the last valid TNT bit is followed by a trailing 1, or Stop bit, as
shown above. If the TNT packet is not full (fewer than 6 TNT bits for the Short TNT, or fewer than 47 TNT bits for
the Long TNT), the Stop bit moves up, and the trailing bits of the packet are filled with 0s. Examples of these
“partial TNTs” are shown below.
7 6 5 4 3 2 1 0
0 0 0 1 B1 B2 B3 B4 0 Short TNT
7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 1 0 Long TNT
1 1 0 1 0 0 0 1 1
2 B24 B25 B26 B27 B28 B29 B30 B31
3 B16 B17 B18 B19 B20 B21 B22 B23
4 B8 B9 B10 B11 B12 B13 B14 B15
5 1 B1 B2 B3 B4 B5 B6 B7
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
Dependencies PacketEn Generation On a conditional branch or compressed RET, if it fills the TNT.
Scenario Also, partial TNTs may be generated at any time, as a result of
other packets being generated,
or certain micro-architectural conditions occurring, before the
TNT is full.
35-38 Vol. 3C
INTEL® PROCESSOR TRACE
Dependencies PacketEn Generation Sce- Indirect branch (including un-compressed RET), far branch, interrupt,
nario exception, INIT, SIPI, VM exit, VM entry, TSX abort, EENTER, EEXIT, ERE-
SUME, AEX1.
Description Provides the target for some control flow transfers
Application Anytime a TIP is encountered, it indicates that control was transferred to the IP provided in the payload.
The source of this control flow change, and hence the IP or instruction to which it binds, depends on the packets
that precede the TIP. If a TIP is encountered and all preceding packets have already been bound, then the TIP will
apply to the upcoming indirect branch, far branch, or VMRESUME. However, if there was a preceding FUP that
remains unbound, it will bind to the TIP. Here, the TIP provides the target of an asynchronous event or TSX abort
that occurred at the IP given in the FUP payload. Note that there may be other packets, in addition to the FUP, which
will bind to the TIP packet. See the packet application descriptions for other packets for details.
NOTES:
1. EENTER, EEXIT, ERESUME, AEX would be possible only for a debug enclave.
IP Compression
The IP payload in a TIP. FUP, TIP.PGE, or TIP.PGD packet can vary in size, based on the mode of execution, and the
use of IP compression. IP compression is an optional compression technique the processor may choose to employ
to reduce bandwidth. With IP compression, the IP to be represented in the payload is compared with the last IP
sent out, via any of FUP, TIP, TIP.PGE, or TIP.PGD. If that previous IP had the same upper (most significant) address
bytes, those matching bytes may be suppressed in the current packet. The processor maintains an internal state
of the “Last IP” that was encoded in trace packets, thus the decoder will need to keep track of the “Last IP” state in
Vol. 3C 35-39
INTEL® PROCESSOR TRACE
software, to match fidelity with packets generated by hardware. “Last IP” is initialized to zero, hence if the first IP
in the trace may be compressed if the upper bytes are zeroes.
The “IPBytes” field of the IP packets (FUP, TIP, TIP.PGE, TIP.PGD) serves to indicate how many bytes of payload are
provided, and how the decoder should fill in any suppressed bytes. The algorithm for reconstructing the IP for a
TIP/FUP packet is shown in the table below.
The processor-internal Last IP state is guaranteed to be reset to zero when a PSB is sent out. This means that the
IP that follows the PSB with either be un-compressed (011b or 110b, see Table 35-19), or compressed against
zero.
At times, “IPbytes” will have a value of 0. As shown above, this does not mean that the IP payload matches the full
address of the last IP, but rather that the IP for this packet was suppressed. This is used for cases where the IP that
applies to the packet is out of context. An example is the TIP.PGD sent on a SYSCALL, when tracing only USR code.
In that case, no TargetIP will be included in the packet, since that would expose an instruction point at CPL = 0.
When the IP payload is suppressed in this manner, Last IP is not cleared, and instead refers to the last IP packet
with a non-zero IPBytes field.
On processors that support a maximum linear address size of 32 bits, IP payloads may never exceed 32 bits
(IPBytes <= 010b).
35-40 Vol. 3C
INTEL® PROCESSOR TRACE
packet. This can happen, for example, if software executes a PUSH instruction to push a target onto the stack, and
a later RET uses this target.
When a RET is compressed, a Taken indication is added to the TNT buffer. Because it sends no TIP packet, it also
does not update the internal Last IP value, and thus the decoder should treat it the same way. If the RET is not
compressed, it will generate a TIP packet (just like when RET compression is disabled, via
IA32_RTIT_CTL.DisRETC). For processors that employ deferred TIPs (Section 35.4.2.3), an uncompressed RET will
not be deferred, and hence will force out any accumulated TNTs or TIPs. This serves to avoid ambiguity, and make
clear to the decoder whether the near RET was compressed, and hence a bit in the in-progress TNT should be
consumed, or uncompressed, in which case there will be no in-progress TNT and thus a TIP should be consumed.
Note that in the unlikely case that a RET executes in a different execution mode than the associated CALL, the
decoder will need to model the same behavior with its CALL stack. For instance, if a CALL executes in 64-bit mode,
a 64-bit IP value will be pushed onto the software stack. If the corresponding RET executes in 32-bit mode, then
only the lower 32 target bits will be popped off of the stack, which may mean that the RET does not go to the CALL’s
Next IP. This is architecturally correct behavior, and this RET could be compressed, thus the decoder should match
this behavior
Vol. 3C 35-41
INTEL® PROCESSOR TRACE
Dependencies PacketEn transitions to 1 Generation Any branch instruction, control flow transfer, or MOV
Scenario CR3 that sets PacketEn, a WRMSR that enables
packet generation and sets PacketEn
Description Indicates that PacketEn has transitioned to 1. It provides the IP at which the tracing begins.
This can occur due to any of the enables that comprise PacketEn transitioning from 0 to 1, as long as all the others
are asserted. Examples:
• TriggerEn: This is set on software write to set IA32_RTIT_CTL.TraceEn as long as the Stopped and Error bits in
IA32_RTIT_STATUS are clear. The IP payload will be the Next IP of the WRMSR.
• FilterEn: This is set when software jumps into the tracing region. This region is defined by enabling IP filtering in
IA32_RTIT_CTL.ADDRn_CFG, and defining the range in IA32_RTIT_ADDRn_[AB], see. Section 35.2.4.3. The
IP payload will be the target of the branch.
• ContextEn: This is set on a CPL change, a CR3 write or any other means of changing ContextEn. The IP payload
will be the Next IP of the instruction that changes context if it is not a branch, otherwise it will be the target of
the branch.
Application TIP.PGE packets bind to the instruction at the IP given in the payload.
35-42 Vol. 3C
INTEL® PROCESSOR TRACE
Dependencies PacketEn transitions to Generation Any branch instruction, control flow transfer, or MOV CR3 that clears
0 Scenario PacketEn, a WRMSR that disables packet generation and clears PacketEn
Description Indicates that PacketEn has transitioned to 0. It will include the IP at which the tracing ends, unless ContextEn= 0 or
TraceEn=0 at the conclusion of the instruction or event that cleared PacketEn.
PacketEn can be cleared due to any of the enables that comprise PacketEn transitioning from 1 to 0. Examples:
• TriggerEn: This is cleared on software write to clear IA32_RTIT_CTL.TraceEn, or when
IA32_RTIT_STATUS.Stopped is set, or on operational error. The IP payload will be suppressed in this case, and the
“IPBytes” field will have the value 0.
• FilterEn: This is cleared when software jumps out of the tracing region. This region is defined by enabling IP
filtering in IA32_RTIT_CTL.ADDRn_CFG, and defining the range in IA32_RTIT_ADDRn_[AB], see. Section 35.2.4.3.
The IP payload will depend on the type of the branch. For conditional branches, the payload is suppressed
(IPBytes = 0), and in this case the destination can be inferred from the disassembly. For any other type of branch,
the IP payload will be the target of the branch.
• ContextEn: This can happen on a CPL change, a CR3 write or any other means of changing ContextEn. See
Section 35.2.4.3 for details. In this case, when ContextEn is cleared, there will be no IP payload. The “IPBytes”
field will have value 0.
Note that, in cases where a branch that would normally produce a TIP packet (i.e., far transfer, indirect branch, inter-
rupt, etc) or TNT update (conditional branch or compressed RT) causes PacketEn to transition from 1 to 0, the TIP or
TNT bit will be replaced with TIP.PGD. The payload of the TIP.PGD will be the target of the branch, unless the result
of the instruction causes TraceEn or ContextEn to be cleared (ie, SYSCALL when IA32_RTIT_CTL.OS=0, In the case
where a conditional branch clears FilterEn and hence PacketEn, there will be no TNT bit for this branch, replaced
instead by the TIP.PGD.
Application TIP.PGD can be produced by any branch instructions, as well as some non-branch instructions, that clear PacketEn.
When produced by a branch, it replaces any TIP or TNT update that the branch would normally produce.
In cases where there is an unbound FUP preceding the TIP.PGD, then the TIP.PGD is part of compound operation (i.e.,
asynchronous event or TSX abort) which cleared PacketEn. For most such cases, the TIP.PGD is simply replacing a
TIP, and should be treated the same way. The TIP.PGD may or may not have an IP payload, depending on whether
the operation cleared ContextEn.
If there is not an associated FUP, the binding will depend on whether there is an IP payload. If there is an IP payload,
then the TIP.PGD should be applied to either the next direct branch whose target matches the TIP.PGD payload, or
the next branch that would normally generate a TIP or TNT packet. If there is no IP payload, then the TIP.PGD should
apply to the next branch or MOV CR3 instruction.
Vol. 3C 35-43
INTEL® PROCESSOR TRACE
7 6 5 4 3 2 1 0
0 IPBytes 1 1 1 0 1
1 IP[7:0]
2 IP[15:8]
3 IP[23:16]
4 IP[31:24]
5 IP[39:32]
6 IP[47:40]
7 IP[55:48]
8 IP[63:56]
Dependencies TriggerEn & ContextEn. Generation Asynchronous Events (interrupts, exceptions, INIT, SIPI, SMI, VM exit,
(Typically depends on Scenario #MC), XBEGIN, XEND, XABORT, XACQUIRE, XRELEASE, EENTER,
BranchEn and FilterEn as well, EEXIT, ERESUME, EEE, AEX,1, INTO, INT1, INT3, INT n, a WRMSR that
see Section 35.2.4 for details.) disables packet generation.
Description Provides the source address for asynchronous events, and some other instructions. Is never sent alone, always sent
with an associated TIP or MODE packet, and potentially others.
Application FUP packets provide the IP to which they bind. However, they are never standalone, but are coupled with other
packets.
In TSX cases, the FUP is immediately preceded by a MODE.TSX, which binds to the same IP. A TIP will follow only in
the case of TSX aborts, see Section 35.4.2.8 for details.
Otherwise, FUPs are part of compound packet events (see Section 35.4.1). In these compound cases, the FUP pro-
vides the source IP for an instruction or event, while a following TIP (or TIP.PGD) packet will provide the destination
IP. Other packets may be included in the compound event between the FUP and TIP.
NOTES:
1. EENTER, EEXIT, ERESUME, EEE, AEX apply only if Intel Software Guard Extensions is supported.
35-44 Vol. 3C
INTEL® PROCESSOR TRACE
FUP IP Payload
Flow Update Packet gives the source address of an instruction when it is needed. In general, branch instructions do
not need a FUP, because the source address is clear from the disassembly. For asynchronous events, however, the
source address cannot be inferred from the source, and hence a FUP will be sent. Table 35-24 illustrates cases
where FUPs are sent, and which IP can be expected in those cases.
On a canonical fault due to sequentially fetching an instruction in non-canonical space (as opposed to jumping to
non-canonical space), the IP of the fault (and thus the payload of the FUP) will be a non-canonical address. This is
consistent with what is pushed on the stack for such faulting cases.
If there are post-commit task switch faults, the IP value of the FUP will be the original IP when the task switch
started. This is the same value as would be seen in the LBR_FROM field. But it is a different value as is saved on the
stack or VMCS.
Vol. 3C 35-45
INTEL® PROCESSOR TRACE
Dependencies TriggerEn && ContextEn && Generation MOV CR3, Task switch, INIT, SIPI, PSB+, VM exit,
IA32_RTIT_CTL.OS Scenario VM entry
Description The CR3 payload shown includes only the address portion of the CR3 value. For PAE paging, CR3[11:5] are thus
included. For other paging modes (32-bit and 4-level paging1), these bits are 0.
This packet holds the CR3 address value. It will be generated on operations that modify CR3:
• MOV CR3 operation
• Task Switch
• INIT and SIPI
• VM exit, if “conceal VMX from PT” VM-exit control is 0 (see Section 35.5.1)
• VM entry, if “conceal VMX from PT” VM-entry control is 0
PIPs are not generated, despite changes to CR3, on SMI and RSM. This is due to the special behavior on these oper-
ations, see Section 35.2.8.3 for details. Note that, for some cases of task switch where CR3 is not modified, no PIP
will be produced.
The purpose of the PIP is to indicate to the decoder which application is running, so that it can apply the proper
binaries to the linear addresses that are being traced.
The PIP packet contains the new CR3 value when CR3 is written.
PIPs generated by VM entries set the NR bit. PIPs generated in VMX non-root operation set the NR bit if the “con-
ceal VMX from PT” VM-execution control is 0 (see Section 35.5.1). All other PIPs clear the NR bit.
Application The purpose of the PIP packet is to help the decoder uniquely identify what software is running at any given time.
When a PIP is encountered, a decoder should do the following:
1) If there was a prior unbound FUP (that is, a FUP not preceded by a packet such as MODE.TSX that consumes it,
and it hence pairs with a TIP that has not yet been seen), then this PIP is part of a compound packet event (Section
35.4.1). Find the ending TIP and apply the new CR3/NR values to the TIP payload IP.
2) Otherwise, look for the next MOV CR3, far branch, or VMRESUME/VMLAUNCH in the disassembly, and apply the
new CR3 to the next (or target) IP.
For examples of the packets generated by these flows, see Section 35.7.
NOTES:
1. Earlier versions of this manual used the term “IA-32e paging” to identify 4-level paging.
35-46 Vol. 3C
INTEL® PROCESSOR TRACE
The MODE Leaf ID indicates which set of mode bits are held in the lower bits.
MODE.Exec Packet
Dependencies PacketEn Generation Far branch, interrupt, exception, VM exit, and VM entry, if the mode changes.
Scenario PSB+, and any scenario that can generate a TIP.PGE, such that the mode may have
changed since the last MODE.Exec.
Description Indicates whether software is in 16, 32, or 64-bit mode, by providing the CS.D and (CS.L & IA32_EFER.LMA) values.
Essential for the decoder to properly disassemble the associated binary.
MODE.Exec is sent at the time of a mode change, if PacketEn=1 at the time, or when tracing resumes, if necessary.
In the former case, the MODE.Exec packet is generated along with other packets that result from the far transfer
operation that changes the mode. In cases where the mode changes while PacketEn=0, the processor will send out
a MODE.Exec along with the TIP.PGE when tracing resumes. The processor may opt to suppress the MODE.Exec
when tracing resumes if the mode matches that from the last MODE.Exec packet, if there was no PSB in between.
Application MODE.Exec always immediately precedes a TIP or TIP.PGE. The mode change applies to the IP address in the payload
of the next TIP or TIP.PGE.
Vol. 3C 35-47
INTEL® PROCESSOR TRACE
MODE.TSX Packet
Dependencies TriggerEn and ContextEn Generation XBEGIN, XEND, XABORT, XACQUIRE, XRELEASE, if InTX
Scenario changes, Asynchronous TSX Abort, PSB+
Description Indicates when a TSX transaction (either HLE or RTM) begins, commits, or aborts. Instructions executed transaction-
ally will be “rolled back” if the transaction is aborted.
Application If PacketEn=1, MODE.TSX always immediately precedes a FUP. If the TXAbort bit is zero, then the mode change
applies to the IP address in the payload of the FUP. If TXAbort=1, then the FUP will be followed by a TIP, and the
mode change will apply to the IP address in the payload of the TIP.
MODE.TSX packets may be generated when PacketEn=0, due to FilterEn=0. In this case, only the last MODE.TSX
generated before TIP.PGE need be applied.
35-48 Vol. 3C
INTEL® PROCESSOR TRACE
Dependencies TriggerEn && ContextEn Generation Taken branch with target in TraceStop IP region, MOV CR3 in TraceS-
Scenario top IP region, or WRMSR that sets TraceEn in TraceStop IP region.
Description Indicates when software has entered a user-configured TraceStop region.
When the IP matches a TraceStop range while ContextEn and TriggerEn are set, a TraceStop action occurs. This dis-
ables tracing by setting IA32_RTIT_STATUS.Stopped, thereby clearing TriggerEn, and causes a TraceStop
packet to be generated.
The TraceStop action also forces FilterEn to 0. Note that TraceStop may not force a flush of internally buffered
packets, and thus trace packet generation should still be manually disabled by clearing IA32_RTIT_CTL.TraceEn
before examining output. See Section 35.2.4.3 for more details.
Application If TraceStop follows a TIP.PGD (before the next TIP.PGE), then it was triggered either by the instruction that cleared
PacketEn, or it was triggered by some later instruction that executed while FilterEn=0. In either case, the TraceStop
can be applied at the IP of the TIP.PGD (if any).
If TraceStop follows a TIP.PGE (before the next TIP.PGD), it should be applied at the last known IP.
Dependencies TriggerEn Generation After any frequency change, on C-state wake up, PSB+, and after
Scenario enabling trace packet generation.
Description Indicates the core:bus ratio of the processor core. Useful for correlating wall-clock time and cycle time.
Application The CBR packet indicates the point in the trace when a frequency transition has occurred. On some implementa-
tions, software execution will continue during transitions to a new frequency, while on others software execution
ceases during frequency transitions. There is not a precise IP provided, to which to bind the CBR packet.
Vol. 3C 35-49
INTEL® PROCESSOR TRACE
Dependencies IA32_RTIT_CTL.TSCEn && Generation Sent after any event that causes the processor clocks or Intel PT timing
TriggerEn Scenario packets (such as MTC or CYC) to stop, This may include P-state changes,
wake from C-state, or clock modulation. Also on transition of TraceEn
from 0 to 1.
Description When enabled by software, a TSC packet provides the lower 7 bytes of the current TSC value, as returned by the
RDTSC instruction. This may be useful for tracking wall-clock time, and synchronizing the packets in the log with
other timestamped logs.
Application TSC packet provides a wall-clock proxy of the event which generated it (packet generation enable, sleep state wake,
etc). In all cases, TSC does not precisely indicate the time of any control flow packets; however, all preceding packets
represent instructions that executed before the indicated TSC time, and all subsequent packets represent instruc-
tions that executed after it. There is not a precise IP to which to bind the TSC packet.
35-50 Vol. 3C
INTEL® PROCESSOR TRACE
Dependencies IA32_RTIT_CTL.MTCEn && Generation Periodic, based on the core crystal clock, or Always Running Timer
TriggerEn Scenario (ART).
Description When enabled by software, an MTC packet provides a periodic indication of wall-clock time. The 8-bit CTC (Common
Timestamp Copy) payload value is set to (ART >> N) & FFH. The frequency of the ART is related to the Maximum
Non-Turbo frequency, and the ratio can be determined from CPUID leaf 15H, as described in Section 35.8.3.
Software can select the threshold N, which determines the MTC frequency by setting the IA32_RTIT_CTL.MTCFreq
field (see Section 35.2.7.2) to a supported value using the lookup enumerated by CPUID (see Section 35.3.1).
See Section 35.8.3 for details on how to use the MTC payload to track TSC time.
MTC provides 8 bits from the ART, starting with the bit selected by MTCFreq to dictate the frequency of the packet.
Whenever that 8-bit range being watched changes, an MTC packet will be sent out with the new value of that 8-bit
range. This allows the decoder to keep track of how much wall-clock time has elapsed since the last TSC packet was
sent, by keeping track of how many MTC packets were sent and what their value was. The decoder can infer the
truncated bits, CTC[N-1:0], are 0 at the time of the MTC packet.
There are cases in which MTC packet can be dropped, due to overflow or other micro-architectural conditions. The
decoder should be able to recover from such cases by checking the 8-bit payload of the next MTC packet, to deter-
mine how many MTC packets were dropped. It is not expected that >256 consecutive MTC packets should ever be
dropped.
Application MTC does not precisely indicate the time of any other packet, nor does it bind to any IP. However, all preceding pack-
ets represent instructions or events that executed before the indicated ART time, and all subsequent packets repre-
sent instructions that executed after, or at the same time as, the ART time.
Vol. 3C 35-51
INTEL® PROCESSOR TRACE
Dependencies IA32_RTIT_CTL.MTCEn && Generation Sce- Sent with any TSC packet.
IA32_RTIT_CTL.TSCEn && TriggerEn nario
Description The TMA packet serves to provide the information needed to allow the decoder to correlate MTC packets with TSC
packets. With this packet, when a MTC packet is encountered, the decoder can determine how many timestamp
counter ticks have passed since the last TSC or MTC packet. See Section 35.8.3.2 for details on how to make this cal-
culation.
Application TMA is always sent immediately following a TSC packet, and the payload values are consistent with the TSC payload
value. Thus the application of TMA matches that of TSC.
35-52 Vol. 3C
INTEL® PROCESSOR TRACE
Dependencies IA32_RTIT_CTL.CYCEn && Generation Sce- Can be sent at any time, though a maximum of one CYC packet is
TriggerEn nario sent per core clock cycle. See Section 35.3.6 for CYC-eligible packets.
Description The Cycle Counter field increments at the same rate as the processor core clock ticks, but with a variable length for-
mat (using a trailing EXP bit field) and a range-capped byte length.
If the CYC value is less than 32, a 1-byte CYC will be generated, with Exp=0. If the CYC value is between 32 and
4095 inclusive, a 2-byte CYC will be generated, with byte 0 Exp=1 and byte 1 Exp=0. And so on.
CYC provides the number of core clocks that have passed since the last CYC packet. CYC can be configured to be
sent in every cycle in which an eligible packet is generated, or software can opt to use a threshold to limit the num-
ber of CYC packets, at the expense of some precision. These settings are configured using the
IA32_RTIT_CTL.CycThresh field (see Section 35.2.7.2). For details on Cycle-Accurate Mode, IPC calculation, etc, see
Section 35.3.6.
When CycThresh=0, and hence no threshold is in use, then a CYC packet will be generated in any cycle in which any
CYC-eligible packet is generated. The CYC packet will precede the other packets generated in the cycle, and provides
the precise cycle time of the packets that follow.
In addition to these CYC packets generated with other packets, CYC packets can be sent stand-alone. These packets
serve simply to update the decoder with the number of cycles passed, and are used to ensure that a wrap of the
processor’s internal cycle counter doesn’t cause cycle information to be lost. These stand-alone CYC packets do not
indicate the cycle time of any other packet or operation, and will be followed by another CYC packet before any
other CYC-eligible packet is seen.
When CycThresh>0, CYC packets are generated only after a minimum number of cycles have passed since the last
CYC packet. Once this threshold has passed, the behavior above resumes, where CYC will either be sent in the next
cycle that produces other CYC-eligible packets, or could be sent stand-alone.
When using CYC thresholds, only the cycle time of the operation (instruction or event) that generates the CYC
packet is truly known. Other operations simply have their execution time bounded: they completed at or after the
last CYC time, and before the next CYC time.
Application CYC provides the offset cycle time (since the last CYC packet) for the CYC-eligible packet that follows. If another CYC
is encountered before the next CYC-eligible packet, the cycle values should be accumulated and applied to the next
CYC-eligible packet.
If a CYC packet is generated by a TNT, note that the cycle time provided by the CYC packet applies to the first
branch in the TNT packet.
Vol. 3C 35-53
INTEL® PROCESSOR TRACE
Dependencies TriggerEn && ContextEn; Generation Scenario Generated on successful VMPTRLD, and optionally on SMM
Also in VMX operation. VM exits and VM entries that return from SMM (see Section
35.4.2.26).
Description The VMCS packet provides a VMCS pointer for a decoder to determine the transition of code contexts:
• On a successful VMPTRLD (i.e., a VMPTRLD that doesn’t fault, fail, or VM exit), the VMCS packet contains the
logical processor’s VMCS pointer established by VMPTRLD (for subsequent execution of a VM guest context).
• An SMM VM exit loads the logical processor’s VMCS pointer with the SMM-transfer VMCS pointer. If the “conceal
VMX from PT” VM-exit control is 0 (see Section 35.5.1), a VMCS packet provides this pointer. See Section 35.6 on
tracing inside and outside STM.
• A VM entry that returns from SMM loads the logical processor’s VMCS pointer from a field in the SMM-transfer
VMCS. If the “conceal VMX from PT” VM-entry control is 0, a VMCS packet provides this pointer. Whether the
VM entry is to VMX root operation or VMX non-root operation is indicated by the PIP.NR bit.
A VMCS packet generated before a VMCS pointer has been loaded, or after the VMCS pointer has been cleared will
set all 64 bits in the VMCS pointer field.
VMCS packets will not be seen on processors with IA32_VMX_MISC[bit 14]=0, as these processors do not allow
TraceEn to be set in VMX operation.
Application The purpose of the VMCS packet is to help the decoder uniquely identify changes in the executing software context
in situations that CR3 may not be unique.
When a VMCS packet is encountered, a decoder should do the following:
• If there was a prior unbound FUP (that is, a FUP not preceded by a packet such as MODE.TSX that consumes it, and
it hence pairs with a TIP that has not yet been seen), then this VMCS is part of a compound packet event (Section
35.4.1). Find the ending TIP and apply the new VMCS base pointer value to the TIP payload IP.
• Otherwise, look for the next VMPTRLD, VMRESUME, or VMLAUNCH in the disassembly, and apply the new VMCS
base pointer on the next VM entry.
For examples of the packets generated by these flows, see Section 35.7.
35-54 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-55
INTEL® PROCESSOR TRACE
Dependencies TriggerEn Generation Always follows PSB packet, separated by PSB+ packets
Scenario
Description PSBEND is simply a terminator for the series of “status only” (PSB+) packets that follow PSB (Section 35.3.7).
Application When a PSBEND packet is seen, the decoder should cease to treat packets as “status only”.
35-56 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-57
INTEL® PROCESSOR TRACE
The PayloadBytes field indicates the number of bytes of payload that follow the header bytes. Encodings are as fol-
lows:
IP bit indicates if a FUP, whose payload will be the IP of the PTWRITE instruction, will follow.
Dependencies TriggerEn & ContextEn & FilterEn Generation PTWRITE Instruction
& PTWEn Scenario
Description Contains the value held in the PTWRITE operand.
This packet is CYC-eligible, and hence will generate a CYC packet if IA32_RTIT_CTL.CYCEn=1 and any CYC Threshold
has been reached.
Application Binds to the associated PTWRITE instruction. The IP of the PTWRITE will be provided by a following FUP, when
PTW.IP=1.
35-58 Vol. 3C
INTEL® PROCESSOR TRACE
Dependencies TriggerEn & PwrEvtEn Generation C-state entry, P-state change, or other processor clock power-
Scenario down. Includes :
• Entry to C-state deeper than C0.0
• TM1/2
• STPCLK#
• Frequency change due to IA32_CLOCK_MODULATION, Turbo
Description This packet indicates that software execution has stopped due to processor clock powerdown. Later packets will
indicate when execution resumes.
If EXSTOP is generated while ContextEn is set, the IP bit will be set, and EXSTOP will be followed by a FUP packet
containing the IP at which execution stopped. More precisely, this will be the IP of the oldest instruction that has
not yet completed.
This packet is CYC-eligible, and hence will generate a CYC packet if IA32_RTIT_CTL.CYCEn=1 and any CYC Threshold
has been reached.
Application If a FUP follows EXSTOP (hence IP bit set), the EXSTOP can be bound to the FUP IP. Otherwise the IP is not known.
Time of powerdown can be inferred from the preceding CYC, if CYCEn=1. Combined with the TSC at the time of
wake (if TSCEn=1), this can be used to determine the duration of the powerdown.
Vol. 3C 35-59
INTEL® PROCESSOR TRACE
Dependencies TriggerEn & PwrEvtEn & Contex- Generation MWAIT, UMWAIT, or TPAUSE instructions, or I/O redirection to
tEn Scenario MWAIT, that complete without fault or VMexit.
Description Indicates that an MWAIT operation to C-state deeper than C0.0 completed. The MWAIT hints and extensions passed
in by software are exposed in the payload. For UMWAIT and TPAUSE, the EXT field holds the input register value
that determines the optimized state requested.
For entry to some highly optimized C0 sub-C-states, such as C0.1, no MWAIT packet is generated.
This packet is CYC-eligible, and hence will generate a CYC packet if IA32_RTIT_CTL.CYCEn=1 and any CYC Threshold
has been reached.
Application The MWAIT packet should bind to the IP of the next FUP, which will be the IP of the instruction that caused the
MWAIT. This FUP will be shared with EXSTOP.
35-60 Vol. 3C
INTEL® PROCESSOR TRACE
Dependencies TriggerEn & PwrEvtEn Generation Transition to a C-state deeper than C0.0.
Scenario
Description Indicates processor entry to the resolved thread C-state and sub C-state indicated. The processor will remain in this
C-state until either another PWRE indicates the processor has moved to a C-state deeper than C0.0, or a PWRX
packet indicates a return to C0.0.
For entry to some highly optimized C0 sub-C-states, such as C0.1, no PWRE packet is generated.
Note that some CPUs may allow MWAIT to request a deeper C-state than is supported by the core. These deeper C-
states may have platform-level implications that differentiate them. However, the PWRE packet will provide only
the resolved thread C-state, which will not exceed that supported by the core.
If the C-state entry was initiated by hardware, rather than a direct software request (such as MWAIT, UMWAIT,
TPAUSE, HLT, or shutdown), the HW bit will be set to indicate this. Hardware Duty Cycling (see Section 14.5, “Hard-
ware Duty Cycling (HDC)” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B) is an
example of such a case.
Application When transitioning from C0.0 to a deeper C-state, the PWRE packet will be followed by an EXSTOP. If that EXSTOP
packet has the IP bit set, then the following FUP will provide the IP at which the C-state entry occurred. Subsequent
PWRE packets generated before the next PWRX should bind to the same IP.
Vol. 3C 35-61
INTEL® PROCESSOR TRACE
Dependencies TriggerEn & PwrEvtEn Generation Transition from a C-state deeper than C0.0 to C0.
Scenario
Description Indicates processor return to thread C0 from a C-state deeper than C0.0.
For return from some highly optimized C0 sub-C-states, such as C0.1, no PWRX packet is generated.
The Last Core C-State field provides the MWAIT encoding for the core C-state at the time of the wake. The Deepest
Core C-State provides the MWAIT encoding for the deepest core C-state achieved during the sleep session, or since
leaving thread C0. MWAIT encodings for C-states can be found in Table 4-11 in the Intel® 64 and IA-32 Architec-
tures Software Developer’s Manual, Volume 2B. Note that these values reflect only the core C-state, and hence will
not exceed the maximum supported core C-state, even if deeper C-states can be requested.
The Wake Reason field is one-hot, encoded as follows:
Application PWRX will always apply to the same IP as the PWRE. The time of wake can be discerned from (optional) timing pack-
ets that precede PWRX.
35-62 Vol. 3C
INTEL® PROCESSOR TRACE
Application A BBP will always be followed by a Block End Packet (BEP), and when the block is generated while ContextEn=1
that BEP will have IP=1 and be followed by a FUP that provides the IP to which the block should be bound. Note
that, in addition to BEP, a block can be terminated by a BBP (indicating the start of a new block) or an OVF packet.
Vol. 3C 35-63
INTEL® PROCESSOR TRACE
7 6 5 4 3 2 1 0
0 ID[5:0] 1 0 0
1 Payload[7:0]
2 Payload[15:8]
3 Payload[23:16]
4 Payload[31:24]
5 Payload[39:32]
6 Payload[47:40]
7 Payload[55:48]
8 Payload[63:56]
7 6 5 4 3 2 1 0
0 ID[5:0] 1 0 0
1 Payload[7:0]
2 Payload[15:8]
3 Payload[23:16]
4 Payload[31:24]
35-64 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-65
INTEL® PROCESSOR TRACE
35-66 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-67
INTEL® PROCESSOR TRACE
35-68 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-69
INTEL® PROCESSOR TRACE
35-70 Vol. 3C
INTEL® PROCESSOR TRACE
The 0-settings of these VMX controls enable all VMX-specific packet information. The scenarios that would use
these default settings also do not require the VMM to use VMX MSR-load areas to enable and disable trace-packet
generation across VMX transitions.
If IA32_VMX_MISC[bit 14] reports 0, the 1-settings of the VMX controls in Table 35-51 are not supported, and
VM entry will fail on any attempt to set them.
Since the VMX controls that suppress packet generation are cleared, a VMCS packet will be included in all PSB+ for
this usage scenario. Additionally, VMPTRLD will generate such a packet. Thus the decoder can distinguish the
execution context of different VMs.
When the host VMM configures a system to collect trace packets in this scenario, it should emulate CPUID to report
CPUID.(EAX=07H, ECX=0):EBX[bit 26] as 0 to guests, indicating to guests that Intel PT is not available.
Vol. 3C 35-71
INTEL® PROCESSOR TRACE
packets generated in VMX non-root operation and the absence of TSC adjustments in TSC packets generated in
VMX root operation. The VMM can supply this information to the decoder.
When tracing only the host, the decoder does not need information about the guests, and the VMX controls for
suppressing VMX-specific packets can be set to reduce the packets generated. VMCS packets will still be generated
on execution of VMPTRLD and in PSB+ generated in the host, but these will be unused by the decoder.
The packets of interests to a decoder when trace packets are collected for host-only tracing are shown in Table 35-
53.
35-72 Vol. 3C
INTEL® PROCESSOR TRACE
with software logs based on RDTSC, the VMM should either make corresponding modifications to the TSC packet
values in the guest trace, or use mechanisms such as TSC offsetting or TSC scaling in place of exiting.
Vol. 3C 35-73
INTEL® PROCESSOR TRACE
35-74 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-75
INTEL® PROCESSOR TRACE
35-76 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-77
INTEL® PROCESSOR TRACE
35-78 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-79
INTEL® PROCESSOR TRACE
35-80 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-81
INTEL® PROCESSOR TRACE
In Table 35-56, PktEn is evaluated based on (TiggerEn & ContextEn & FilterEn & BranchEn & PTWEn).
Table 35-56. PwrEvtEn and PTWEn Packet Generation under Different Enable Conditions
Case Operation PktEn PktEn CntxEn Other Dependencies Packets Output
Before After After
16.24a PTWRITE rm32/64, no fault D.C. D.C. D.C. None
16.24b PTWRITE rm32/64, no fault D.C. 0 0 None
16.24d PTWRITE rm32, no fault D.C. 1 1 * FUP, IP=1 if FUPonPTW=1 PTW(IP=1?, 4B,
rm32_value), FUP(CLIP)?
16.24e PTWRITE rm64, no fault D.C. 1 1 * FUP, IP=1 if FUPonPTW=1 PTW(IP=1?, 8B,
rm64_value), FUP(CLIP)?
16.25a PTWRITE mem32/64, fault D.C. D.C. D.C. See “Exception/fault”
(cases 13[a-z] in Table
35-55) for BranchEn
packets.
35-82 Vol. 3C
INTEL® PROCESSOR TRACE
Vol. 3C 35-83
INTEL® PROCESSOR TRACE
that of the timestamp counter. The periodic MTC packet is generated based on software-selected multiples of
the crystal clock frequency. The MTC packet is expected to occur more frequently than the TSC packet.
• Processor core clock
The processor core clock frequency can vary due to P-state and thermal conditions. The CYC packet provides
elapsed time as measured in processor core clock cycles relative to the last CYC packet.
A decoder can use all or some combination of these packets to track time at different resolutions throughout the
trace packets.
35-84 Vol. 3C
INTEL® PROCESSOR TRACE
Note that stand-alone TSC packets (that is, TSC packets that are not a part of a PSB+) are typically generated only
when generation of other timing packets (MTCs and CYCs) has ceased for a period of time. Example scenarios
include when Intel PT is re-enabled, or on wake after a sleep state. Thus any calculation of ART or cycle time
leading up to a TSC packet will likely result in a discrepancy, which the TSC packet serves to correct.
A greater level of precision may be achieved by calculating the CPU clock frequency, see Section 35.8.3.4 below for
a method to do so using Intel PT packets.
CYCs can be used to estimate time between TSCs even without MTCs, though this will likely result in a reduction in
estimated TSC precision.
Vol. 3C 35-85
INTEL® PROCESSOR TRACE
35-86 Vol. 3C
26.Updates to Chapter 36, Volume 3D
Change bars show changes to Chapter 36 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3D: System Programming Guide, Part 4.
------------------------------------------------------------------------------------------
Changes to this chapter: Typo corrections. Added Section 36.8 “Intel® SGX Interactions with Control-flow
Enforcement Technology”.
CHAPTER 36
INTRODUCTION TO INTEL® SOFTWARE GUARD EXTENSIONS
36.1 OVERVIEW
Intel Software Guard Extensions (Intel® SGX) is a set of instructions and mechanisms for memory accesses
®
added to Intel® Architecture processors. Intel SGX can encompass two collections of instruction extensions,
referred to as SGX1 and SGX2, see Table 36-1 and Table 36-2. The SGX1 extensions allow an application to instan-
tiate a protected container, referred to as an enclave. The enclave is a trusted area of memory, where critical
aspects of the application functionality have hardware-enhanced confidentiality and integrity protections. New
access controls to restrict access to software not resident in the enclave are also introduced. The SGX2 extensions
allow additional flexibility in runtime management of enclave resources and thread execution within an enclave.
Chapter 37 covers main concepts, objects and data structure formats that interact within the Intel SGX architec-
ture. Chapter 38 covers operational aspects ranging from preparing an enclave, transferring control to enclave
code, and programming considerations for the enclave code and system software providing support for enclave
execution. Chapter 39 describes the behavior of Asynchronous Enclave Exit (AEX) caused by events while
executing enclave code. Chapter 40 covers the syntax and operational details of the instruction and associated leaf
functions available in Intel SGX. Chapter 41 describes interaction of various aspects of IA32 and Intel® 64 archi-
tectures with Intel SGX. Chapter 42 covers Intel SGX support for application debug, profiling and performance
monitoring.
OS
Entry Table
Enclave
Enclave Heap
Enclave Code
App Code
Vol. 3D 36-1
INTRODUCTION TO INTEL® SOFTWARE GUARD EXTENSIONS
Intel SGX introduces two significant capabilities to the Intel Architecture. First is the change in enclave memory
access semantics. The second is protection of the address mappings of the application.
36-2 Vol. 3D
INTRODUCTION TO INTEL® SOFTWARE GUARD EXTENSIONS
is accessible to the program currently executing. Generally an EPC page is only accessed by the owner of the
executing enclave or an instruction which is setting up an EPC page
The EPC is divided into EPC pages. An EPC page is 4KB in size and always aligned on a 4KB boundary.
Pages in the EPC can either be valid or invalid. Every valid page in the EPC belongs to one enclave instance. Each
enclave instance has an EPC page that holds its SECS. The security metadata for each EPC page is held in an
internal micro-architectural structure called Enclave Page Cache Map (EPCM, see Section 36.5.1).
The EPC is managed by privileged software. Intel SGX provides a set of instructions for adding and removing
content to and from the EPC. The EPC may be configured by BIOS at boot time. On implementations in which EPC
memory is part of system DRAM, the contents of the EPC are protected by an encryption engine.
Table 36-1. Supervisor and User Mode Enclave Instruction Leaf Functions in Long-Form of SGX1
Supervisor Instruction Description User Instruction Description
ENCLS[EADD] Add an EPC page to an enclave. ENCLU[EENTER] Enter an enclave.
ENCLS[EBLOCK] Block an EPC page. ENCLU[EEXIT] Exit an enclave.
ENCLS[ECREATE] Create an enclave. ENCLU[EGETKEY] Create a cryptographic key.
ENCLS[EDBGRD] Read data from a debug enclave by debug- ENCLU[EREPORT] Create a cryptographic report.
ger.
Vol. 3D 36-3
INTRODUCTION TO INTEL® SOFTWARE GUARD EXTENSIONS
Table 36-1. Supervisor and User Mode Enclave Instruction Leaf Functions in Long-Form of SGX1
Supervisor Instruction Description User Instruction Description
ENCLS[EDBGWR] Write data into a debug enclave by debug- ENCLU[ERESUME] Re-enter an enclave.
ger.
ENCLS[EEXTEND] Extend EPC page measurement.
ENCLS[EINIT] Initialize an enclave.
ENCLS[ELDB] Load an EPC page in blocked state.
ENCLS[ELDU] Load an EPC page in unblocked state.
ENCLS[EPA] Add an EPC page to create a version array.
ENCLS[EREMOVE] Remove an EPC page from an enclave.
ENCLS[ETRACK] Activate EBLOCK checks.
ENCLS[EWB] Write back/invalidate an EPC page.
Table 36-2. Supervisor and User Mode Enclave Instruction Leaf Functions in Long-Form of SGX2
Supervisor Instruction Description User Instruction Description
ENCLS[EAUG] Allocate EPC page to an existing enclave. ENCLU[EACCEPT] Accept EPC page into the enclave.
ENCLS[EMODPR] Restrict page permissions. ENCLU[EMODPE] Enhance page permissions.
ENCLS[EMODT] Modify EPC page type. ENCLU[EACCEPTCOPY] Copy contents to an augmented EPC
page and accept the EPC page into
the enclave.
Table 36-3. VMX Operation and Supervisor Mode Enclave Instruction Leaf Functions in Long-Form of OVERSUB
VMX Operation Description Supervisor Instruction Description
ENCLV[EDECVIRTCHILD] Decrement the virtual child page count. ENCLS[ERDINFO] Read information about EPC page.
ENCLV[EINCVIRTCHILD] Increment the virtual child page count. ENCLS[ETRACKC] Activate EBLOCK checks with conflict
reporting.
ENCLV[ESETCONTEXT] Set virtualization context. ENCLS[ELDBC/UC] Load an EPC page with conflict
reporting.
36-4 Vol. 3D
INTRODUCTION TO INTEL® SOFTWARE GUARD EXTENSIONS
Table 36-5. CPUID Leaf 12H, Sub-Leaf 0 Enumeration of Intel® SGX Capabilities
CPUID.(EAX=12H,ECX=0) Description Behavior
Register Bits
EAX 0 SGX1: If 1, indicates leaf functions of SGX1 instruction listed in Table 36-1 are supported.
1 SGX2: If 1, indicates leaf functions of SGX2 instruction listed in Table 36-2 are supported.
4:2 Reserved (0)
5 OVERSUB: If 1, indicates Intel SGX supports instructions: EINCVIRTCHILD, EDECVIRTCHILD, and
ESETCONTEXT.
6 OVERSUB: If 1, indicates Intel SGX supports instructions: ETRACKC, ERDINFO, ELDBC, and ELDUC.
31:7 Reserved (0)
31:0 MISCSELECT: Reports the bit vector of supported extended features that can be written to the MISC
EBX
region of the SSA.
ECX 31:0 Reserved (0).
Vol. 3D 36-5
INTRODUCTION TO INTEL® SOFTWARE GUARD EXTENSIONS
Table 36-5. CPUID Leaf 12H, Sub-Leaf 0 Enumeration of Intel® SGX Capabilities
CPUID.(EAX=12H,ECX=0) Description Behavior
Register Bits
7:0 MaxEnclaveSize_Not64: the maximum supported enclave size is 2^(EDX[7:0]) bytes when not in 64-bit
mode.
EDX 15:8 MaxEnclaveSize_64: the maximum supported enclave size is 2^(EDX[15:8]) bytes when operating in 64-
bit mode.
31:16 Reserved (0).
Table 36-6. CPUID Leaf 12H, Sub-Leaf 1 Enumeration of Intel® SGX Capabilities
CPUID.(EAX=12H,ECX=1) Description Behavior
Register Bits
EAX 31:0 Report the valid bits of SECS.ATTRIBUTES[31:0] that software can set with ECREATE.
SECS.ATTRIBUTES[n] can be set to 1 using ECREATE only if EAX[n] is 1, where n < 32.
EBX 31:0 Report the valid bits of SECS.ATTRIBUTES[63:32] that software can set with ECREATE.
SECS.ATTRIBUTES[n+32] can be set to 1 using ECREATE only if EBX[n] is 1, where n < 32.
ECX 31:0 Report the valid bits of SECS.ATTRIBUTES[95:64] that software can set with ECREATE.
SECS.ATTRIBUTES[n+64] can be set to 1 using ECREATE only if ECX[n] is 1, where n < 32.
EDX 31:0 Report the valid bits of SECS.ATTRIBUTES[127:96] that software can set with ECREATE.
SECS.ATTRIBUTES[n+96] can be set to 1 using ECREATE only if EDX[n] is 1, where n < 32.
On processors that support Intel SGX1 and SGX2, CPUID leaf 12H sub-leaf 2 report physical memory resources
available for use with Intel SGX. These physical memory sections are typically allocated by BIOS as Processor
Reserved Memory, and available to the OS to manage as EPC.
To enumerate how many EPC sections are available to the EPC manager, software can enumerate CPUID leaf 12H
with sub-leaf index starting from 2, and decode the sub-leaf-type encoding (returned in EAX[3:0]) until the sub-
leaf type is invalid. All invalid sub-leaves of CPUID leaf 12H return EAX/EBX/ECX/EDX with 0.
Table 36-7. CPUID Leaf 12H, Sub-Leaf Index 2 or Higher Enumeration of Intel® SGX Resources
CPUID.(EAX=12H,ECX > 1) Description Behavior
Register Bits
EAX 3:0 0000b: This sub-leaf is invalid; EDX:ECX:EBX:EAX return 0.
0001b: This sub-leaf enumerates an EPC section. EBX:EAX and EDX:ECX provide information on the
Enclave Page Cache (EPC) section.
All other encoding are reserved.
11:4 Reserved (enumerate 0).
31:12 If EAX[3:0] = 0001b, these are bits 31:12 of the physical address of the base of the EPC section.
19:0 If EAX[3:0] = 0001b, these are bits 51:32 of the physical address of the base of the EPC section.
EBX
31:20 Reserved.
3: 0 If EAX[3:0] 0000b, then all bits of the EDX:ECX pair are enumerated as 0.
If EAX[3:0] 0001b, then this section has confidentiality and integrity protection.
All other encoding are reserved.
ECX
11:4 Reserved (enumerate 0).
31:12 If EAX[3:0] = 0001b, these are bits 31:12 of the size of the corresponding EPC section within the
Processor Reserved Memory.
36-6 Vol. 3D
INTRODUCTION TO INTEL® SOFTWARE GUARD EXTENSIONS
Table 36-7. CPUID Leaf 12H, Sub-Leaf Index 2 or Higher Enumeration of Intel® SGX Resources
CPUID.(EAX=12H,ECX > 1) Description Behavior
Register Bits
EDX 19: 0 If EAX[3:0] = 0001b, these are bits 51:32 of the size of the corresponding EPC section within the
Processor Reserved Memory.
31:20 Reserved.
Vol. 3D 36-7
INTRODUCTION TO INTEL® SOFTWARE GUARD EXTENSIONS
36-8 Vol. 3D
27.Updates to Chapter 37, Volume 3D
Change bars show changes to Chapter 37 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3D: System Programming Guide, Part 4.
------------------------------------------------------------------------------------------
Changes to chapter: Added Control-flow Enforcement Technology details.
CHAPTER 37
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37.2 TERMINOLOGY
A memory access to the ELRANGE and initiated by an instruction executed by an enclave is called a Direct Enclave
Access (Direct EA).
Memory accesses initiated by certain Intel® SGX instruction leaf functions such as ECREATE, EADD, EDBGRD,
EDBGWR, ELDU/ELDB, EWB, EREMOVE, EENTER, and ERESUME to EPC pages are called Indirect Enclave Accesses
(Indirect EA). Table 37-1 lists additional details of the indirect EA of SGX1 and SGX2 extensions.
Direct EAs and Indirect EAs together are called Enclave Accesses (EAs).
Any memory access that is not an Enclave Access is called a non-enclave access.
Vol. 3D 37-1
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
• Direct EAs to any EPC pages must conform to the currently defined security attributes for that EPC page in the
EPCM. These attributes may be defined at enclave creation time (EADD) or when the enclave sets them using
SGX2 instructions. The failure of these checks results in a #PF with the PFEC.SGX bit set.
— Target page must belong to the currently executing enclave.
— Data may be written to an EPC page if the EPCM allow write access.
— Data may be read from an EPC page if the EPCM allow read access.
— Instruction fetches from an EPC page are allowed if the EPCM allows execute access.
— Shadow-stack-load from an EPC page and shadow-stack-store to an EPC page are allowed only if the page
type is PT_SS_FIRST or PT_SS_REST.
— Data writes that are not shadow-stack-store are not allowed if the EPCM page type is PT_SS_FIRST or
PT_SS_REST.
— Target page must not have a restricted page type1 (PT_SECS, PT_TCS, PT_VA, or PT_TRIM).
— The EPC page must not be BLOCKED.
— The EPC page must not be PENDING.
— The EPC page must not be MODIFIED.
1. EPCM may allow write, read or execute access only for pages with page type PT_REG.
37-2 Vol. 3D
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
processor splits the access into two sub-accesses (one inside ELRANGE and the other outside ELRANGE), and each
access is evaluated. A code-fetch access that splits across ELRANGE results in a #GP due to the portion that lies
outside of the ELRANGE.
Table 37-1. List of Implicit and Explicit Memory Access by Intel® SGX Enclave Instructions
Instr. Leaf Enum. Explicit 1 Explicit 2 Explicit 3 Implicit
EACCEPT SGX2 SECINFO EPCPAGE SECS
EACCEPTCOPY SGX2 SECINFO EPCPAGE (Src) EPCPAGE (Dst)
EADD SGX1 PAGEINFO and linked structures EPCPAGE
EAUG SGX2 PAGEINFO and linked structures EPCPAGE SECS
EBLOCK SGX1 EPCPAGE SECS
ECREATE SGX1 PAGEINFO and linked structures EPCPAGE
EDBGRD SGX1 EPCADDR Destination SECS
EDBGWR SGX1 EPCADDR Source SECS
EDECVIRTCHILD OVERSUB EPCPAGE SECS
EENTER SGX1 TCS and linked SSA SECS
Vol. 3D 37-3
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
Table 37-1. List of Implicit and Explicit Memory Access by Intel® SGX Enclave Instructions (Contd.)
Instr. Leaf Enum. Explicit 1 Explicit 2 Explicit 3 Implicit
EEXIT SGX1 SECS, TCS
EEXTEND SGX1 SECS EPCPAGE
EGETKEY SGX1 KEYREQUEST KEY SECS
EINCVIRTCHILD OVERSUB EPCPAGE SECS
EINIT SGX1 SIGSTRUCT SECS EINITTOKEN
ELDB/ELDU SGX1 PAGEINFO and linked structures, PCMD EPCPAGE VAPAGE
ELDBC/ELDUC OVERSUB PAGEINFO and linked structures EPCPAGE VAPAGE
EMODPE SGX2 SECINFO EPCPAGE
EMODPR SGX2 SECINFO EPCPAGE SECS
EMODT SGX2 SECINFO EPCPAGE SECS
EPA SGX1 EPCADDR
ERDINFO OVERSUB RDINFO EPCPAGE
EREMOVE SGX1 EPCPAGE SECS
EREPORT SGX1 TARGETINFO REPORTDATA OUTPUTDATA SECS
ERESUME SGX1 TCS and linked SSA SECS
ESETCONTEXT OVERSUB SECS ContextValue
ETRACK SGX1 EPCPAGE
ETRACKC OVERSUB EPCPAGE
EWB SGX1 PAGEINFO and linked structures, PCMD EPCPAGE VAPAGE SECS
Asynchronous Enclave Exit* SECS, TCS,
SSA
*Details of Asynchronous Enclave Exit (AEX) is described in Section 39.4
37-4 Vol. 3D
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
Details of the top-level data structures and associated sub-structures are listed in Section 37.7 through Section
37.20.
Vol. 3D 37-5
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37.7.1 ATTRIBUTES
The ATTRIBUTES data structure is comprised of bit-granular fields that are used in the SECS, the REPORT and the
KEYREQUEST structures. CPUID.(EAX=12H, ECX=1) enumerates a bitmap of permitted 1-setting of bits in ATTRI-
BUTES.
37-6 Vol. 3D
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
Vol. 3D 37-7
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37.8.1 TCS.FLAGS
37-8 Vol. 3D
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37.9.1.1 EXITINFO
EXITINFO contains the information used to report exit reasons to software inside the enclave. It is a 4 byte field laid
out as in Table 37-10. The VALID bit is set only for the exceptions conditions which are reported inside an enclave.
See Table 37-11 for which exceptions are reported inside the enclave. If the exception condition is not one reported
inside the enclave then VECTOR and EXIT_TYPE are cleared.
Vol. 3D 37-9
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37-10 Vol. 3D
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
Vol. 3D 37-11
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37.12.1 SECINFO.FLAGS
The SECINFO.FLAGS are a set of fields describing the properties of an enclave page.
37-12 Vol. 3D
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
Vol. 3D 37-13
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37-14 Vol. 3D
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
Vol. 3D 37-15
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37-16 Vol. 3D
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37.16.1 REPORTDATA
REPORTDATA is a 64-Byte data structure that is provided by the enclave and included in the REPORT. It can be used
to securely pass information from the enclave to the target enclave.
Vol. 3D 37-17
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37-18 Vol. 3D
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
Vol. 3D 37-19
ENCLAVE ACCESS CONTROL AND DATA STRUCTURES
37-20 Vol. 3D
28.Updates to Chapter 39, Volume 3D
Change bars show changes to Chapter 39 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3D: System Programming Guide, Part 4.
------------------------------------------------------------------------------------------
Changes to chapter: Added Control-flow Enforcement Technology details.
CHAPTER 39
ENCLAVE EXITING EVENTS
Certain events, such as exceptions and interrupts, incident to (but asynchronous with) enclave execution may
cause control to transition outside of enclave mode. (Most of these also cause a change of privilege level.) To
protect the integrity and security of the enclave, the processor will exit the enclave (and enclave mode) before
invoking the handler for such an event. For that reason, such events are called enclave-exiting events (EEE);
EEEs include external interrupts, non-maskable interrupts, system-management interrupts, exceptions, and VM
exits.
The process of leaving an enclave in response to an EEE is called an asynchronous enclave exit (AEX). To protect
the secrecy of the enclave, an AEX saves the state of certain registers within enclave memory and then loads those
registers with fixed values called synthetic state.
SS
RSP
ENCLU[ERESUME] RAX
TCS LA RBX RFLAGS
AEP RCX CS
RIP
Error Code (optional) RSP after pushes
RSP
Figure 39-1. Exit Stack Just After Interrupt with Stack Switch
Vol. 3D 39-1
ENCLAVE EXITING EVENTS
In all cases, the choice of exit stack and the information pushed onto it is consistent with non-SGX operation.
Figure 39-1 shows the Application and Exiting Stacks after an exit with a stack switch. An exit without a stack
switch uses the Application Stack. The ERESUME leaf index value is placed into RAX, the TCS pointer is placed in
RBX and the AEP (see below) is placed into RCX to facilitate resuming the enclave after the exit.
Upon an AEX, the AEP (Asynchronous Exit Pointer) is loaded into the RIP. The AEP points to a trampoline code
sequence which includes the ERESUME instruction that is later used to reenter the enclave.
The following bits of RFLAGS are cleared before RFLAGS is pushed onto the exit stack: CF, PF, AF, ZF, SF, OF, RF. The
remaining bits are left unchanged.
GRP_N-1
NSSA
MISC_N-1
CSSA
XSAVE_N-1
OSSA
GPR_1
MISC_1
Current
XSAVE_1 SSA Fram
SECS.SSAFRAMESIZE GRP_0
(in pages)
MISC_0
XAVE_0
The location of the SSA frames to be used is controlled by the following variables in the TCS and the SECS:
• Size of a frame in the State Save Area (SECS.SSAFRAMESIZE): This defines the number of 4-KByte pages in a
single frame in the State Save Area. The SSA frame size must be large enough to hold the GPR state, the XSAVE
state, and the MISC state.
• Base address of the enclave (SECS.BASEADDR): This defines the enclave's base linear address from which the
offset to the base of the SSA stack is calculated.
• Number of State Save Area Slots (TCS.NSSA): This defines the total number of slots (frames) in the State Save
Area stack.
• Current State Save Area Slot (TCS.CSSA): This defines the slot to use on the next exit.
• State Save Area Offset (TCS.OSSA): This defines the offset of the base address of a set of State Save Area slots
from the enclave’s base address.
When an AEX occurs, hardware selects the SSA frame to use by examining TCS.CSSA. Processor state is saved into
the SSA frame (see Section 39.4) and loaded with a synthetic state (as described in Section 39.3.1) to avoid
leaking secrets, RSP and RBP are restored to their values prior to enclave entry, and TCS.CSSA is incremented. As
will be described later, if an exception takes the last slot, it will not be possible to reenter the enclave to handle the
39-2 Vol. 3D
ENCLAVE EXITING EVENTS
exception from within the enclave. A subsequent ERESUME restores the processor state from the current SSA
frame and frees the SSA frame.
The format of the XSAVE section of SSA is identical to the format used by the XSAVE/XRSTOR instructions. On
EENTER, CSSA must be less than NSSA, ensuring that there is at least one State Save Area slot available for exits.
If there is no free SSA frame when executing EENTER, the entry will fail.
Vol. 3D 39-3
ENCLAVE EXITING EVENTS
39-4 Vol. 3D
ENCLAVE EXITING EVENTS
The pseudo code in this section describes the internal operations that are executed when an AEX occurs in enclave
mode. These operations occur just before the normal interrupt or exception processing occurs.
(* Save all registers, When saving EFLAGS, the TF bit is set to 0 and
the RF bit is set to what would have been saved on stack in the non-SGX case *)
IF (TMP_MODE64 = 0)
THEN
Save EAX, EBX, ECX, EDX, ESP, EBP, ESI, EDI, EFLAGS, EIP into the current SSA frame using
CR_GPR_PA; (* see Table 40-5 for list of CREGs used to describe internal operation within Intel SGX *)
SSA.RFLAGS.TF 0;
ELSE (* TMP_MODE64 = 1 *)
Save RAX, RBX, RCX, RDX, RSP, RBP, RSI, RDI, R8-R15, RFLAGS, RIP into the current SSA frame using
CR_GPR_PA;
SSA.RFLAGS.TF 0;
FI;
Save FS and GS BASE into SSA using CR_GPR_PA;
(* store XSAVE state into the current SSA frame's XSAVE area using the physical addresses
that were determined and cached at enclave entry time with CR_XSAVE_PAGE_i. *)
For each XSAVE state i defined by (SECS.ATTRIBUTES.XFRM[i] = 1, destination address cached in
CR_XSAVE_PAGE_i)
SSA.XSAVE.i XSAVE_STATE_i;
CR_XSAVE_PAGE_0.XHEADER_BV[191:64] 0;
CR_XSAVE_PAGE_0.XHEADER_BV[63:0]
CR_XSAVE_PAGE_0.XHEADER_BV[63:0] & SECS(CR_ACTIVE_SECS).ATTRIBUTES.XFRM;
Apply synthetic state to GPRs, RFLAGS, extended features, etc.
(* Restore the RSP and RBP from the current SSA frame's GPR area using the physical address
that was determined and cached at enclave entry time with CR_GPR_PA. *)
RSP CR_GPR_PA.URSP;
RBP CR_GPR_PA.URBP;
Vol. 3D 39-5
ENCLAVE EXITING EVENTS
39-6 Vol. 3D
ENCLAVE EXITING EVENTS
CR_GPR_PA.EXITINFO.VECTOR 0;
CR_GPR_PA.EXITINFO.EXIT_TYPE 0
CR_GPR_PA.REASON.VALID 0;
FI;
IF (CPUID.(EAX=12H, ECX=1):EAX[6]= 1)
THEN
IF (CR4.CET == 1 AND IA32_U_CET.SH_STK_EN == 1)
THEN
CR_CET_SAVE_AREA_PA.SSP SSP;
CR_TCS_PA.PREVSSP SSP;
FI;
IF (CR4.CET == 1 AND IA32_U_CET.ENDBR_EN == 1)
THEN
CR_CET_SAVE_AREA_PA.TRACKER IA32_U_CET.TRACKER;
CR_CET_SAVE_AREA_PA.SUPPRESS IA32_U_CET.SUPPRESS;
FI;
IF (CPUID.(EAX=7, ECX=0):ECX[CET_SS])
SSP CR_SAVE_SSP; FI;
Vol. 3D 39-7
ENCLAVE EXITING EVENTS
IF (CR_DBGOPTIN = 0)
THEN
Un-suppress all breakpoints that overlap ELRANGE
(* Clear suppressed breakpoint matches *)
Restore suppressed breakpoint matches
(* Restore TF *)
RFLAGS.TF CR_SAVE_TF;
Un-suppress monitor trap flag;
Un-suppress branch recording facilities;
Un-suppress all suppressed performance monitoring activity;
Promote any sibling-thread counters that were demoted from AnyThread to MyThread during enclave
entry back to AnyThread;
FI;
(* end_of_flow *)
(* Execution continues with normal event processing. *)
39-8 Vol. 3D
29.Updates to Chapter 40, Volume 3D
Change bars show changes to Chapter 40 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3D: System Programming Guide, Part 4.
------------------------------------------------------------------------------------------
Changes to chapter: Minor typo corrections, and added Control-flow Enforcement Technology details.
CHAPTER 40
SGX INSTRUCTION REFERENCES
This chapter describes the supervisor and user level instructions provided by Intel® Software Guard Extensions
(Intel® SGX). In general, various functionality is encoded as leaf functions within the ENCLS (supervisor), ENCLU
(user), and the ENCLV (virtualization operation) instruction mnemonics. Different leaf functions are encoded by
specifying an input value in the EAX register of the respective instruction mnemonic.
Vol. 3D 40-1
SGX INSTRUCTION REFERENCES
Table 40-3. Register Usage of Virtualization Operation Enclave Instruction Leaf Functions
Instr. Leaf EAX RBX RCX RDX
EDECVIRTCHILD 00H (In) EPCPAGE (In, EA) SECS (In, EA)
EINCVIRTCHILD 01H (In) EPCPAGE (In, EA) SECS (In, EA)
ESETCONTEXT 02H (In) EPCPAGE (In, EA) Context Value (In, EA)
EA: Effective Address
40-2 Vol. 3D
SGX INSTRUCTION REFERENCES
Vol. 3D 40-3
SGX INSTRUCTION REFERENCES
40-4 Vol. 3D
SGX INSTRUCTION REFERENCES
• Concurrent Access means that any other leaf function that requires any access to the same EPC page may be
executed concurrently. For example, EGETKEY has no concurrency requirements for the KEYREQUEST page.
In addition to the base concurrency requirements, additional concurrency requirements are listed, which apply
only to specific sets of leaf functions. For example, there are additional requirements that apply for EADD, EXTEND
and EINIT. EADD and EEXTEND can't execute concurrently on the same SECS page.
The tables also detail the leaf function's behavior when a conflict happens, i.e., a concurrency requirement is not
met. In this case, the leaf function may return an SGX_EPC_PAGE_CONFLICT error code in RAX, or it may cause an
exception. In addition, the tables detail those conflicts where a VM Exit may be triggered, and list the Exit Qualifi-
cation code that is provided in such cases.
Vol. 3D 40-5
SGX INSTRUCTION REFERENCES
40-6 Vol. 3D
SGX INSTRUCTION REFERENCES
Vol. 3D 40-7
SGX INSTRUCTION REFERENCES
40-8 Vol. 3D
SGX INSTRUCTION REFERENCES
Description
The ENCLS instruction invokes the specified privileged Intel SGX leaf function for managing and debugging
enclaves. Software specifies the leaf function by setting the appropriate value in the register EAX as input. The
registers RBX, RCX, and RDX have leaf-specific purpose, and may act as input, as output, or may be unused. In 64-
bit mode, the instruction ignores upper 32 bits of the RAX register.
The ENCLS instruction produces an invalid-opcode exception (#UD) if CR0.PE = 0 or RFLAGS.VM = 1, or if it is
executed in system-management mode (SMM). Additionally, any attempt to execute the instruction when CPL > 0
results in #UD. The instruction produces a general-protection exception (#GP) if CR0.PG = 0 or if an attempt is
made to invoke an undefined leaf function.
In VMX non-root operation, execution of ENCLS may cause a VM exit if the “enable ENCLS exiting” VM-execution
control is 1. In this case, execution of individual leaf functions of ENCLS is governed by the ENCLS-exiting bitmap
field in the VMCS. Each bit in that field corresponds to the index of an ENCLS leaf function (as provided in EAX).
Software in VMX root operation can thus intercept the invocation of various ENCLS leaf functions in VMX non-root
operation by setting the “enable ENCLS exiting” VM-execution control and setting the corresponding bits in the
ENCLS-exiting bitmap.
Addresses and operands are 32 bits outside 64-bit mode (IA32_EFER.LMA = 0 || CS.L = 0) and are 64 bits in 64-
bit mode (IA32_EFER.LMA = 1 || CS.L = 1). CS.D value has no impact on address calculation. The DS segment is
used to create linear addresses.
Segment override prefixes and address-size override prefixes are ignored, and is the REX prefix in 64-bit mode.
Operation
IF TSX_ACTIVE
THEN GOTO TSX_ABORT_PROCESSING; FI;
IF (CPL > 0)
THEN #UD; FI;
IF in VMX non-root operation and the “enable ENCLS exiting“ VM-execution control is 1
THEN
IF EAX < 63 and ENCLS_exiting_bitmap[EAX] = 1 or EAX> 62 and ENCLS_exiting_bitmap[63] = 1
THEN VM exit;
FI;
FI;
IF IA32_FEATURE_CONTROL.LOCK = 0 or IA32_FEATURE_CONTROL.SGX_ENABLE = 0
THEN #GP(0); FI;
Vol. 3D 40-9
SGX INSTRUCTION REFERENCES
IF CR0.PG = 0
THEN #GP(0); FI;
Flags Affected
See individual leaf functions
40-10 Vol. 3D
SGX INSTRUCTION REFERENCES
Description
The ENCLU instruction invokes the specified non-privileged Intel SGX leaf functions. Software specifies the leaf
function by setting the appropriate value in the register EAX as input. The registers RBX, RCX, and RDX have leaf-
specific purpose, and may act as input, as output, or may be unused. In 64-bit mode, the instruction ignores upper
32 bits of the RAX register.
The ENCLU instruction produces an invalid-opcode exception (#UD) if CR0.PE = 0 or RFLAGS.VM = 1, or if it is
executed in system-management mode (SMM). Additionally, any attempt to execute this instruction when CPL < 3
results in #UD. The instruction produces a general-protection exception (#GP) if either CR0.PG or CR0.NE is 0, or
if an attempt is made to invoke an undefined leaf function. The ENCLU instruction produces a device not available
exception (#NM) if CR0.TS = 1.
Addresses and operands are 32 bits outside 64-bit mode (IA32_EFER.LMA = 0 or CS.L = 0) and are 64 bits in 64-
bit mode (IA32_EFER.LMA = 1 and CS.L = 1). CS.D value has no impact on address calculation. The DS segment
is used to create linear addresses.
Segment override prefixes and address-size override prefixes are ignored, as is the REX prefix in 64-bit mode.
Operation
IN_64BIT_MODE 0;
IF TSX_ACTIVE
THEN GOTO TSX_ABORT_PROCESSING; FI;
(* If enclosing app has CET indirect branch tracking enabled then if it is not ERESUME leaf cause a #CP fault *)
(* If the ERESUME is not successful it will leave tracker in WAIT_FOR_ENDBRANCH *)
TRACKER = (CPL == 3) ? IA32_U_CET.TRACKER : IA32_S_CET.TRACKER
IF EndbranchEnabledAndNotSuppressed(CPL) and TRACKER = WAIT_FOR_ENDBRANCH and
(EAX != ERESUME or CR0.TS or (in SMM) or (CPUID.SGX_LEAF.0:EAX.SE1 = 0) or (CPL < 3))
THEN
Handle CET State machine violation (* see Section 18.3.6, “Legacy Compatibility Treatment” in the
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1. *)
FI;
IF CR0.TS = 1
THEN #NM; FI;
IF CPL < 3
THEN #UD; FI;
IF IA32_FEATURE_CONTROL.LOCK = 0 or IA32_FEATURE_CONTROL.SGX_ENABLE = 0
THEN #GP(0); FI;
Vol. 3D 40-11
SGX INSTRUCTION REFERENCES
IF CR0.PG = 0 or CR0.NE = 0
THEN #GP(0); FI;
Flags Affected
See individual leaf functions
40-12 Vol. 3D
SGX INSTRUCTION REFERENCES
Vol. 3D 40-13
SGX INSTRUCTION REFERENCES
Description
The ENCLV instruction invokes the virtualization SGX leaf functions for managing enclaves in a virtualized environ-
ment. Software specifies the leaf function by setting the appropriate value in the register EAX as input. The regis-
ters RBX, RCX, and RDX have leaf-specific purpose, and may act as input, as output, or may be unused. In non 64-
bit mode, the instruction ignores upper 32 bits of the RAX register.
The ENCLV instruction produces an invalid-opcode exception (#UD) if CR0.PE = 0 or RFLAGS.VM = 1, if it is
executed in system-management mode (SMM), or not in VMX operation. Additionally, any attempt to execute the
instruction when CPL > 0 results in #UD. The instruction produces a general-protection exception (#GP) if CR0.PG
= 0 or if an attempt is made to invoke an undefined leaf function.
Software in VMX root mode of operation can enable execution of the ENCLV instruction in VMX non-root mode by
setting enable ENCLV execution control in the VMCS. If enable ENCLV execution control in the VMCS is clear, execu-
tion of the ENCLV instruction in VMX non-root mode results in #UD.
When execution of ENCLV instruction in VMX non-root mode is enabled, software in VMX root operation can inter-
cept the invocation of various ENCLV leaf functions in VMX non-root operation by setting the corresponding bits in
the ENCLV-exiting bitmap.
Addresses and operands are 32 bits in 32-bit mode (IA32_EFER.LMA == 0 || CS.L == 0) and are 64 bits in 64-bit
mode (IA32_EFER.LMA == 1 && CS.L == 1). CS.D value has no impact on address calculation.
Segment override prefixes and address-size override prefixes are ignored, as is the REX prefix in 64-bit mode.
Operation
IF TSX_ACTIVE
THEN GOTO TSX_ABORT_PROCESSING; FI;
IF (CPL > 0)
THEN #UD; FI;
40-14 Vol. 3D
SGX INSTRUCTION REFERENCES
FI;
IF IA32_FEATURE_CONTROL.LOCK = 0 or IA32_FEATURE_CONTROL.SGX_ENABLE = 0
THEN #GP(0); FI;
IF CR0.PG = 0
THEN #GP(0); FI;
Flags Affected
See individual leaf functions.
Vol. 3D 40-15
SGX INSTRUCTION REFERENCES
40-16 Vol. 3D
SGX INSTRUCTION REFERENCES
Description
This leaf function copies a source page from non-enclave memory into the EPC, associates the EPC page with an
SECS page residing in the EPC, and stores the linear address and security attributes in EPCM. As part of the asso-
ciation, the enclave offset and the security attributes are measured and extended into the SECS.MRENCLAVE. This
instruction can only be executed when current privilege level is 0.
RBX contains the effective address of a PAGEINFO structure while RCX contains the effective address of an EPC
page. The table below provides additional information on the memory parameter of EADD leaf function.
Concurrency Restrictions
Vol. 3D 40-17
SGX INSTRUCTION REFERENCES
Operation
TMP_SRCPGE DS:RBX.SRCPGE;
TMP_SECS DS:RBX.SECS;
TMP_SECINFO DS:RBX.SECINFO;
TMP_LINADDR DS:RBX.LINADDR;
SCRATCH_SECINFO DS:TMP_SECINFO;
40-18 Vol. 3D
SGX INSTRUCTION REFERENCES
IF (EPCM(DS:RCX).VALID ≠ 0)
THEN #PF(DS:RCX); FI;
CASE (SCRATCH_SECINFO.FLAGS.PT)
PT_TCS:
IF (DS:RCX.RESERVED ≠ 0) #GP(0); FI;
IF ( (DS:TMP_SECS.ATTRIBUTES.MODE64BIT = 0) and
((DS:TCS.FSLIMIT & 0FFFH ≠ 0FFFH) or (DS:TCS.GSLIMIT & 0FFFH ≠ 0FFFH) )) #GP(0); FI;
(* Ensure TCS.PREVSSP is zero *)
IF (CPUID.(EAX=07H, ECX=00h):ECX[CET_SS] = 1) and (DS:RCX.PREVSSP != 0) #GP(0); FI;
BREAK;
PT_REG:
IF (SCRATCH_SECINFO.FLAGS.W = 1 and SCRATCH_SECINFO.FLAGS.R = 0) #GP(0); FI;
BREAK;
PT_SS_FIRST:
PT_SS_REST:
(* SS pages cannot created on first or last page of ELRANGE *)
Vol. 3D 40-19
SGX INSTRUCTION REFERENCES
(* Check the enclave offset is within the enclave linear address space *)
IF (TMP_LINADDR < DS:TMP_SECS.BASEADDR or TMP_LINADDR ≥ DS:TMP_SECS.BASEADDR + DS:TMP_SECS.SIZE)
THEN #GP(0); FI;
(* Check if the enclave to which the page will be added is already in Initialized state *)
IF (DS:TMP_SECS already initialized)
THEN #GP(0); FI;
40-20 Vol. 3D
SGX INSTRUCTION REFERENCES
(* associate the EPCPAGE with the SECS by storing the SECS identifier of DS:TMP_SECS *)
Update EPCM(DS:RCX) SECS identifier to reference DS:TMP_SECS identifier;
Flags Affected
None
Vol. 3D 40-21
SGX INSTRUCTION REFERENCES
Description
This leaf function zeroes a page of EPC memory, associates the EPC page with an SECS page residing in the EPC,
and stores the linear address and security attributes in the EPCM. As part of the association, the security attributes
are configured to prevent access to the EPC page until a corresponding invocation of the EACCEPT leaf or EACCEPT-
COPY leaf confirms the addition of the new page into the enclave. This instruction can only be executed when
current privilege level is 0.
RBX contains the effective address of a PAGEINFO structure while RCX contains the effective address of an EPC
page. The table below provides additional information on the memory parameter of the EAUG leaf function.
Concurrency Restrictions
40-22 Vol. 3D
SGX INSTRUCTION REFERENCES
Operation
TMP_SECS DS:RBX.SECS;
TMP_SECINFO DS:RBX.SECINFO;
IF (DS:RBX.SECINFO is not 0)
THEN
IF (DS:TMP_SECINFO is not 64B aligned)
THEN #GP(0); FI;
FI;
TMP_LINADDR DS:RBX.LINADDR;
IF DS:RBX.SRCPAGE is not 0
THEN #GP(0); FI;
Vol. 3D 40-23
SGX INSTRUCTION REFERENCES
IF (EPCM(DS:RCX).VALID ≠ 0)
THEN #PF(DS:RCX); FI;
(* Check if the enclave to which the page will be added is in the Initialized state *)
IF (DS:TMP_SECS is not initialized)
THEN #GP(0); FI;
(* Check the enclave offset is within the enclave linear address space *)
IF ( (TMP_LINADDR < DS:TMP_SECS.BASEADDR) or (TMP_LINADDR ≥ DS:TMP_SECS.BASEADDR + DS:TMP_SECS.SIZE) )
THEN #GP(0); FI;
40-24 Vol. 3D
SGX INSTRUCTION REFERENCES
(* associate the EPCPAGE with the SECS by storing the SECS identifier of DS:TMP_SECS *)
Update EPCM(DS:RCX) SECS identifier to reference DS:TMP_SECS identifier;
Flags Affected
None
Vol. 3D 40-25
SGX INSTRUCTION REFERENCES
Description
This leaf function causes an EPC page to be marked as BLOCKED. This instruction can only be executed when
current privilege level is 0.
The content of RCX is an effective address of an EPC page. The DS segment is used to create linear address.
Segment override is not supported.
An error code is returned in RAX.
The table below provides additional information on the memory parameter of EBLOCK leaf function.
Concurrency Restrictions
40-26 Vol. 3D
SGX INSTRUCTION REFERENCES
Operation
RFLAGS.ZF,CF,PF,AF,OF,SF 0;
RAX 0;
IF (EPCM(DS:RCX). VALID = 0)
THEN
RFLAGS.ZF 1;
RAX SGX_PG_INVLD;
GOTO DONE;
FI;
Vol. 3D 40-27
SGX INSTRUCTION REFERENCES
(* at this point, the page must be valid and PT_TCS or PT_REG or PT_TRIM*)
IF (TMP_BLKSTATE = 1)
THEN
RFLAGS.CF 1;
RAX SGX_BLKSTATE;
ELSE
EPCM(DS:RCX).BLOCKED 1
FI;
DONE:
Flags Affected
Sets ZF if SECS is in use or invalid, otherwise cleared. Sets CF if page is BLOCKED or not blockable, otherwise
cleared. Clears PF, AF, OF, SF.
40-28 Vol. 3D
SGX INSTRUCTION REFERENCES
Description
ENCLS[ECREATE] is the first instruction executed in the enclave build process. ECREATE copies an SECS structure
outside the EPC into an SECS page inside the EPC. The internal structure of SECS is not accessible to software.
ECREATE will set up fields in the protected SECS and mark the page as valid inside the EPC. ECREATE initializes or
checks unused fields.
Software sets the following fields in the source structure: SECS:BASEADDR, SECS:SIZE in bytes, ATTRIBUTES,
CONFIGID and CONFIGSVN. SECS:BASEADDR must be naturally aligned on an SECS.SIZE boundary. SECS.SIZE
must be at least 2 pages (8192).
The source operand RBX contains an effective address of a PAGEINFO structure. PAGEINFO contains an effective
address of a source SECS and an effective address of an SECINFO. The SECS field in PAGEINFO is not used.
The RCX register is the effective address of the destination SECS. It is an address of an empty slot in the EPC. The
SECS structure must be page aligned. SECINFO flags must specify the page as an SECS page.
ECREATE will fault if the SECS target page is in use; already valid; outside the EPC. It will also fault if addresses are
not aligned; unused PAGEINFO fields are not zero.
If the amount of space needed to store the SSA frame is greater than the amount specified in SECS.SSAFRAME-
SIZE, a #GP(0) results. The amount of space needed for an SSA frame is computed based on
DS:TMP_SECS.ATTRIBUTES.XFRM size. Details of computing the size can be found Section 41.7.
Concurrency Restrictions
Vol. 3D 40-29
SGX INSTRUCTION REFERENCES
Operation
TMP_SRCPGE DS:RBX.SRCPGE;
TMP_SECINFO DS:RBX.SECINFO;
IF (DS:RBX.LINADDR ! = 0 or DS:RBX.SECS ≠ 0)
THEN #GP(0); FI;
TMP_SECS RCX;
40-30 Vol. 3D
SGX INSTRUCTION REFERENCES
VMCS.Exit_qualification.code EPC_PAGE_CONFLICT_EXCEPTION;
VMCS.Exit_qualification.error 0;
VMCS.Guest-physical_address
<< translation of DS:TMP_SECS produced by paging >>;
VMCS.Guest-linear_address DS:TMP_SECS;
Deliver VMEXIT;
ELSE
#GP(0);
FI;
FI;
IF (EPCM(DS:RCX).VALID = 1)
THEN #PF(DS:RCX); FI;
IF (XFRM is illegal)
THEN #GP(0); FI;
(* Make sure that the SECS does not have any unsupported MISCSELECT options*)
IF ( !(CPUID.(EAX=12H, ECX=0):EBX[31:0] & DS:TMP_SECS.MISCSELECT[31:0]) )
THEN
EPCM(DS:TMP_SECS).EntryLock.Release();
#GP(0);
FI;
(* Compute the size required to save state of the enclave on async exit, see Section 41.7.2.2*)
Vol. 3D 40-31
SGX INSTRUCTION REFERENCES
(* Ensure that the declared area is large enough to hold XSAVE and GPR stat *)
IF ( DS:TMP_SECS.SSAFRAMESIZE*4096 < TMP_XSIZE)
THEN #GP(0); FI;
(* Enclave size must be at least 8192 bytes and must be power of 2 in bytes*)
IF (DS:TMP_SECS.SIZE < 8192 or popcnt(DS:TMP_SECS.SIZE) > 1)
THEN #GP(0); FI;
40-32 Vol. 3D
SGX INSTRUCTION REFERENCES
FI;
TMPUPDATEFIELD[511:160] 0;
DS:TMP_SECS.MRENCLAVE SHA256UPDATE(DS:TMP_SECS.MRENCLAVE, TMPUPDATEFIELD)
INC enclave’s MRENCLAVE update counter;
(* Set EID *)
DS:TMP_SECS.EID LockedXAdd(CR_NEXT_EID, 1);
(* Set the EPCM entry, first create SECS identifier and store the identifier in EPCM *)
EPCM(DS:TMP_SECS).PT PT_SECS;
EPCM(DS:TMP_SECS).ENCLAVEADDRESS 0;
EPCM(DS:TMP_SECS).R 0;
EPCM(DS:TMP_SECS).W 0;
EPCM(DS:TMP_SECS).X 0;
Flags Affected
None
Vol. 3D 40-33
SGX INSTRUCTION REFERENCES
40-34 Vol. 3D
SGX INSTRUCTION REFERENCES
Description
This leaf function copies a quadword/doubleword from an EPC page belonging to a debug enclave into the RBX
register. Eight bytes are read in 64-bit mode, four bytes are read in non-64-bit modes. The size of data read cannot
be overridden.
The effective address of the source location inside the EPC is provided in the register RCX.
This instruction ignores the EPCM RWX attributes on the enclave page. Consequently, violation of EPCM RWX attri-
butes via EDBGRD does not result in a #GP.
Vol. 3D 40-35
SGX INSTRUCTION REFERENCES
Concurrency Restrictions
Operation
(* make sure no other Intel SGX instruction is accessing the same EPCM entry *)
IF (Another instruction modifying the same EPCM entry is executing)
THEN #GP(0); FI;
IF (EPCM(DS:RCX).VALID = 0)
THEN #PF(DS:RCX); FI;
(* make sure that DS:RCX (SOURCE) is pointing to a PT_REG or PT_TCS or PT_VA or PT_SS_FIRST or PT_SS_REST *)
IF ( (EPCM(DS:RCX).PT ≠ PT_REG) and (EPCM(DS:RCX).PT ≠ PT_TCS) and (EPCM(DS:RCX).PT ≠ PT_VA)
and (EPCM(DS:RCX).PT ≠ PT_SS_FIRST) and (EPCM(DS:RCX).PT ≠ PT_SS_REST))
THEN #PF(DS:RCX); FI;
40-36 Vol. 3D
SGX INSTRUCTION REFERENCES
RAX SGX_PAGE_NOT_DEBUGGABLE;
GOTO DONE;
FI;
(* If source is a TCS, then make sure that the offset into the page is not beyond the TCS size*)
IF ( ( EPCM(DS:RCX). PT = PT_TCS) and ((DS:RCX) & FFFH ≥ SGX_TCS_LIMIT) )
THEN #GP(0); FI;
(* make sure the enclave owning the PT_REG or PT_TCS page allow debug *)
IF ( (EPCM(DS:RCX).PT = PT_REG) or (EPCM(DS:RCX).PT = PT_TCS) )
THEN
TMP_SECS GET_SECS_ADDRESS;
IF (TMP_SECS.ATTRIBUTES.DEBUG = 0)
THEN #GP(0); FI;
IF ( (TMP_MODE64 = 1) )
THEN RBX[63:0] (DS:RCX)[63:0];
ELSE EBX[31:0] (DS:RCX)[31:0];
FI;
ELSE
TMP_64BIT_VAL[63:0] (DS:RCX)[63:0] & (~07H); // Read contents from VA slot
IF (TMP_MODE64 = 1)
THEN
IF (TMP_64BIT_VAL ≠ 0H)
THEN RBX[63:0] 0FFFFFFFFFFFFFFFFH;
ELSE RBX[63:0] 0H;
FI;
ELSE
IF (TMP_64BIT_VAL ≠ 0H)
THEN EBX[31:0] 0FFFFFFFFH;
ELSE EBX[31:0] 0H;
FI;
FI;
DONE:
(* clear flags *)
RFLAGS.CF,PF,AF,OF,SF 0;
Flags Affected
ZF is set if the page is MODIFIED or PENDING; RAX contains the error code. Otherwise ZF is cleared and RAX is set
to 0. CF, PF, AF, OF, SF are cleared.
Vol. 3D 40-37
SGX INSTRUCTION REFERENCES
40-38 Vol. 3D
SGX INSTRUCTION REFERENCES
Description
This leaf function copies the content in EBX/RBX to an EPC page belonging to a debug enclave. Eight bytes are
written in 64-bit mode, four bytes are written in non-64-bit modes. The size of data cannot be overridden.
The effective address of the source location inside the EPC is provided in the register RCX.
This instruction ignores the EPCM RWX attributes on the enclave page. Consequently, violation of EPCM RWX attri-
butes via EDBGRD does not result in a #GP.
Vol. 3D 40-39
SGX INSTRUCTION REFERENCES
Concurrency Restrictions
Operation
(* make sure no other Intel SGX instruction is accessing the same EPCM entry *)
IF (Another instruction modifying the same EPCM entry is executing)
THEN #GP(0); FI;
IF (EPCM(DS:RCX).VALID = 0)
THEN #PF(DS:RCX); FI;
(* make sure that DS:RCX (DST) is pointing to a PT_REG or PT_TCS or PT_SS_FIRST or PT_SS_REST *)
IF ( (EPCM(DS:RCX).PT ≠ PT_REG) and (EPCM(DS:RCX).PT ≠ PT_TCS)
and (EPCM(DS:RCX).PT ≠ PT_SS_FIRST) and (EPCM(DS:RCX).PT ≠ PT_SS_REST))
THEN #PF(DS:RCX); FI;
40-40 Vol. 3D
SGX INSTRUCTION REFERENCES
RAX SGX_PAGE_NOT_DEBUGGABLE;
GOTO DONE;
FI;
(* If destination is a TCS, then make sure that the offset into the page can only point to the FLAGS field*)
IF ( ( EPCM(DS:RCX). PT = PT_TCS) and ((DS:RCX) & FF8H ≠ offset_of_FLAGS & 0FF8H) )
THEN #GP(0); FI;
(* Locate the SECS for the enclave to which the DS:RCX page belongs *)
TMP_SECS GET_SECS_PHYS_ADDRESS(EPCM(DS:RCX).ENCLAVESECS);
(* make sure the enclave owning the PT_REG or PT_TCS page allow debug *)
IF (TMP_SECS.ATTRIBUTES.DEBUG = 0)
THEN #GP(0); FI;
IF ( (TMP_MODE64 = 1) )
THEN (DS:RCX)[63:0] RBX[63:0];
ELSE (DS:RCX)[31:0] EBX[31:0];
FI;
DONE:
(* clear flags *)
RFLAGS.CF,PF,AF,OF,SF 0
Flags Affected
ZF is set if the page is MODIFIED or PENDING; RAX contains the error code. Otherwise ZF is cleared and RAX is set
to 0. CF, PF, AF, OF, SF are cleared.
Vol. 3D 40-41
SGX INSTRUCTION REFERENCES
40-42 Vol. 3D
SGX INSTRUCTION REFERENCES
Description
This leaf function updates the MRENCLAVE measurement register of an SECS with the measurement of an EXTEND
string compromising of “EEXTEND” || ENCLAVEOFFSET || PADDING || 256 bytes of the enclave page. This instruc-
tion can only be executed when current privilege level is 0 and the enclave is uninitialized.
RBX contains the effective address of the SECS of the region to be measured. The address must be the same as the
one used to add the page into the enclave.
RCX contains the effective address of the 256 byte region of an EPC page to be measured. The DS segment is used
to create linear addresses. Segment override is not supported.
Concurrency Restrictions
Vol. 3D 40-43
SGX INSTRUCTION REFERENCES
Operation
IF (EPCM(DS:RCX). VALID = 0)
THEN #PF(DS:RCX); FI;
(* make sure that DS:RCX (DST) is pointing to a PT_REG or PT_TCS or PT_SS_FIRST or PT_SS_REST *)
IF ( (EPCM(DS:RCX).PT ≠ PT_REG) and (EPCM(DS:RCX).PT ≠ PT_TCS)
and (EPCM(DS:RCX).PT ≠ PT_SS_FIRST) and (EPCM(DS:RCX).PT ≠ PT_SS_REST))
THEN #PF(DS:RCX); FI;
TMP_SECS Get_SECS_ADDRESS();
40-44 Vol. 3D
SGX INSTRUCTION REFERENCES
Flags Affected
None
Vol. 3D 40-45
SGX INSTRUCTION REFERENCES
Description
This leaf function is the final instruction executed in the enclave build process. After EINIT, the MRENCLAVE
measurement is complete, and the enclave is ready to start user code execution using the EENTER instruction.
EINIT takes the effective address of a SIGSTRUCT and EINITTOKEN. The SIGSTRUCT describes the enclave
including MRENCLAVE, ATTRIBUTES, ISVSVN, a 3072 bit RSA key, and a signature using the included key.
SIGSTRUCT must be populated with two values, q1 and q2. These are calculated using the formulas shown below:
q1 = floor(Signature2 / Modulus);
q2 = floor((Signature3 - q1 * Signature * Modulus) / Modulus);
The EINITTOKEN contains the MRENCLAVE, MRSIGNER, and ATTRIBUTES. These values must match the corre-
sponding values in the SECS. If the EINITTOKEN was created with a debug launch key, the enclave must be in
debug mode as well.
Signature
Verify
PubKey Hashed
ATTRIBUTES
ATTRIBUTEMASK Check
MRENCLAVE
DS:RBX ATTRIBUTES
MRENCLAVE If VALID=1,
DS:RDX Check
EINIT EINITTOKEN
Copy
ATTRIBUTES
SECS
DS:RCX MRENCLAVE
Check
ENCLAVE
EPC
40-46 Vol. 3D
SGX INSTRUCTION REFERENCES
EINIT performs the following steps, which can be seen in Figure 40-1:
Validates that SIGSTRUCT is signed using the enclosed public key.
Checks that the completed computation of SECS.MRENCLAVE equals SIGSTRUCT.HASHENCLAVE.
Checks that no reserved bits are set to 1 in SIGSTRUCT.ATTRIBUTES and no reserved bits in SIGSTRUCT.ATTRI-
BUTESMASK are set to 0.
Checks that no controlled ATTRIBUTES bits are set in SIGSTRUCT.ATTRIBUTES unless the SHA256 digest of
SIGSTRUCT.MODULUS equals IA32_SGX_LEPUBKEYHASH.
Checks that SIGSTRUCT.ATTRIBUTES equals the result of logically and-ing SIGSTRUCT.ATTRIBUTEMASK with
SECS.ATTRIBUTES.
If EINITTOKEN.VALID is 0, checks that the SHA256 digest of SIGSTRUCT.MODULUS equals
IA32_SGX_LEPUBKEYHASH.
If EINITTOKEN.VALID is 1, checks the validity of EINITTOKEN.
If EINITTOKEN.VALID is 1, checks that EINITTOKEN.MRENCLAVE equals SECS.MRENCLAVE.
If EINITTOKEN.VALID is 1 and EINITTOKEN.ATTRIBUTES.DEBUG is 1, SECS.ATTRIBUTES.DEBUG must be 1.
Commits SECS.MRENCLAVE, and sets SECS.MRSIGNER, SECS.ISVSVN, and SECS.ISVPRODID based on
SIGSTRUCT.
Update the SECS as Initialized.
Periodically, EINIT polls for certain asynchronous events. If such an event is detected, it completes with failure
code (ZF=1 and RAX = SGX_UNMASKED_EVENT), and RIP is incremented to point to the next instruction. These
events includes external interrupts, non-maskable interrupts, system-management interrupts, machine checks,
INIT signals, and the VMX-preemption timer. EINIT does not fail if the pending event is inhibited (e.g., external
interrupts could be inhibited due to blocking by MOV SS blocking or by STI).
The following bits in RFLAGS are cleared: CF, PF, AF, OF, and SF. When the instruction completes with an error,
RFLAGS.ZF is set to 1, and the corresponding error bit is set in RAX. If no error occurs, RFLAGS.ZF is cleared and
RAX is set to 0.
The error codes are:
Vol. 3D 40-47
SGX INSTRUCTION REFERENCES
Concurrency Restrictions
Operation
40-48 Vol. 3D
SGX INSTRUCTION REFERENCES
(* Open “Event Window” Check for Interrupts. Verify signature using embedded public key, q1, and q2. Save upper 352 bytes of the
PKCS1.5 encoded message into the TMP_SIG_PADDING*)
IF (interrupt was pending) THEN
RFLAGS.ZF 1;
RAX SGX_UNMASKED_EVENT;
GOTO EXIT;
FI
IF (signature failed to verify) THEN
RFLAGS.ZF 1;
RAX SGX_INVALID_SIGNATURE;
GOTO EXIT;
FI;
(*Close “Event Window” *)
Vol. 3D 40-49
SGX INSTRUCTION REFERENCES
TMP_MRSIGNER SHA256(TMP_SIG.MODULUS)
(* if controlled ATTRIBUTES are set, SIGSTRUCT must be signed using an authorized key *)
CONTROLLED_ATTRIBUTES 0000000000000020H;
IF ( ( (DS:RCX.ATTRIBUTES & CONTROLLED_ATTRIBUTES) ≠ 0) and (TMP_MRSIGNER ≠ IA32_SGXLEPUBKEYHASH) )
RFLAGS.ZF 1;
RAX SGX_INVALID_ATTRIBUTE;
GOTO EXIT;
FI;
IF (CPUID.(EAX=12H, ECX=1):EAX[6] = 1)
IF ( DS:RCX.CET_ATTRIBUTES & TMP_SIG.CET_ATTRIBUTES_MASK ≠ TMP_SIG.CET_ATTRIBUTES &
TMP_SIG.CET_ATTRIB-UTES_MASK )
THEN
RFLAGS.ZF 1;
RAX SGX_INVALID_ATTRIBUTE;
GOTO EXIT
FI;
FI;
40-50 Vol. 3D
SGX INSTRUCTION REFERENCES
(* Check reserve space in EINIT token includes reserved regions and upper bits in valid field *)
IF (TMP_TOKEN reserved space is not clear)
RFLAGS.ZF 1;
RAX SGX_INVALID_EINITTOKEN;
GOTO EXIT;
FI;
(* EINIT token must not have been created by a configuration beyond the current CPU configuration *)
IF (TMP_TOKEN.CPUSVN must not be a configuration beyond CR_CPUSVN)
RFLAGS.ZF 1;
RAX SGX_INVALID_CPUSVN;
GOTO EXIT;
FI;
TMP_KEYDEPENDENCIES.KEYNAME EINITTOKEN_KEY;
TMP_KEYDEPENDENCIES.ISVFAMILYID 0;
TMP_KEYDEPENDENCIES.ISVEXTPRODID 0;
TMP_KEYDEPENDENCIES.ISVPRODID TMP_TOKEN.ISVPRODIDLE;
TMP_KEYDEPENDENCIES.ISVSVN TMP_TOKEN.ISVSVN;
TMP_KEYDEPENDENCIES.SGXOWNEREPOCH CR_SGXOWNEREPOCH;
TMP_KEYDEPENDENCIES.ATTRIBUTES TMP_TOKEN.MASKEDATTRIBUTESLE;
TMP_KEYDEPENDENCIES.ATTRIBUTESMASK 0;
TMP_KEYDEPENDENCIES.MRENCLAVE 0;
TMP_KEYDEPENDENCIES.MRSIGNER IA32_SGXLEPUBKEYHASH;
TMP_KEYDEPENDENCIES.KEYID TMP_TOKEN.KEYID;
TMP_KEYDEPENDENCIES.SEAL_KEY_FUSES CR_SEAL_FUSES;
TMP_KEYDEPENDENCIES.CPUSVN TMP_TOKEN.CPUSVN;
TMP_KEYDEPENDENCIES.MISCSELECT TMP_TOKEN.MASKEDMISCSELECTLE;
TMP_KEYDEPENDENCIES.MISCMASK 0;
TMP_KEYDEPENDENCIES.PADDING HARDCODED_PKCS1_5_PADDING;
TMP_KEYDEPENDENCIES.KEYPOLICY 0;
TMP_KEYDEPENDENCIES.CONFIGID 0;
TMP_KEYDEPENDENCIES.CONFIGSVN 0;
IF (CPUID.(EAX=12H, ECX=1):EAX[6] = 1))
TMP_KEYDEPENDENCIES.CET_ATTRIBUTES TMP_TOKEN.CET_MASKED_ATTRIBUTES_ LE;
TMP_KEYDEPENDENCIES.CET_ATTRIBUTES_MASK 0;
FI;
(* Verify EINITTOKEN was generated using this CPU's Launch key and that it has not been modified since issuing by the Launch
Enclave. Only 192 bytes of EINITTOKEN are CMACed *)
IF (TMP_TOKEN.MAC ≠ CMAC(TMP_EINITTOKENKEY, TMP_TOKEN[1535:0] ) )
RFLAGS.ZF 1;
RAX SGX_INVALID_EINITTOKEN;
GOTO EXIT;
FI;
Vol. 3D 40-51
SGX INSTRUCTION REFERENCES
COMMIT:
(* Commit changes to the SECS; Set ISVPRODID, ISVSVN, MRSIGNER, INIT ATTRIBUTE fields in SECS (RCX) *)
DS:RCX.MRENCLAVE TMP_MRENCLAVE;
(* MRSIGNER stores a SHA256 in little endian implemented natively on x86 *)
DS:RCX.MRSIGNER TMP_MRSIGNER;
DS:RCX.ISVEXTPRODID TMP_SIG.ISVEXTPRODID;
DS:RCX.ISVPRODID TMP_SIG.ISVPRODID;
DS:RCX.ISVSVN TMP_SIG.ISVSVN;
DS:RCX.ISVFAMILYID TMP_SIG.ISVFAMILYID;
DS:RCX.PADDING TMP_SIG_PADDING;
Flags Affected
ZF is cleared if successful, otherwise ZF is set and RAX contains the error code. CF, PF, AF, OF, SF are cleared.
40-52 Vol. 3D
SGX INSTRUCTION REFERENCES
Vol. 3D 40-53
SGX INSTRUCTION REFERENCES
Description
This leaf function copies a page from regular main memory to the EPC. As part of the copying process, the page is
cryptographically authenticated and decrypted. This instruction can only be executed when current privilege level
is 0.
The ELDB leaf function sets the BLOCK bit in the EPCM entry for the destination page in the EPC after copying. The
ELDU leaf function clears the BLOCK bit in the EPCM entry for the destination page in the EPC after copying.
RBX contains the effective address of a PAGEINFO structure; RCX contains the effective address of the destination
EPC page; RDX holds the effective address of the version array slot that holds the version of the page.
The ELDBC/ELDUC leafs are very similar to ELDB and ELDU. They provide an error code on the concurrency conflict
for any of the pages which need to acquire a lock. These include the destination, SECS, and VA slot.
The table below provides additional information on the memory parameter of ELDB/ELDU leaf functions.
40-54 Vol. 3D
SGX INSTRUCTION REFERENCES
Concurrency Restrictions
Operation
Vol. 3D 40-55
SGX INSTRUCTION REFERENCES
TMP_SRCPGE DS:RBX.SRCPGE;
TMP_SECS DS:RBX.SECS;
TMP_PCMD DS:RBX.PCMD;
(* Check alignment of PAGEINFO (RBX) linked parameters. Note: PCMD pointer is overlaid on top of PAGEINFO.SECINFO field *)
IF ( (DS:TMP_PCMD is not 128Byte aligned) or (DS:TMP_SRCPGE is not 4KByte aligned) )
THEN #GP(0); FI;
40-56 Vol. 3D
SGX INSTRUCTION REFERENCES
FI;
FI;
TMP_HEADER.SECINFO SCRATCH_PCMD.SECINFO;
TMP_HEADER.RSVD SCRATCH_PCMD.RSVD;
TMP_HEADER.LINADDR DS:RBX.LINADDR;
Vol. 3D 40-57
SGX INSTRUCTION REFERENCES
FI;
(* Decrypt and MAC page. AES_GCM_DEC has 2 outputs, {plain text, MAC} *)
(* Parameters for AES_GCM_DEC {Key, Counter, ..} *)
{DS:RCX, TMP_MAC} AES_GCM_DEC(CR_BASE_PK, TMP_VER << 32, TMP_HEADER, 128, DS:RCX, 4096);
IF ( (TMP_MAC ≠ DS:TMP_PCMD.MAC) )
THEN
RFLAGS.ZF 1;
RAX SGX_MAC_COMPARE_FAIL;
GOTO ERROR_EXIT;
FI;
IF (TMP_HEADER.SECINFO.FLAGS.PT is PT_SECS)
<< store translation of DS:RCX produced by paging in SECS(DS:RCX).ENCLAVECONTEXT >>
FI;
40-58 Vol. 3D
SGX INSTRUCTION REFERENCES
EPCM(DS:RCX). VALID 1;
RAX 0;
RFLAGS.ZF 0;
ERROR_EXIT:
RFLAGS.CF,PF,AF,OF,SF 0;
Flags Affected
Sets ZF if unsuccessful, otherwise cleared and RAX returns error code. Clears CF, PF, AF, OF, SF.
Vol. 3D 40-59
SGX INSTRUCTION REFERENCES
Description
This leaf function restricts the access rights associated with an EPC page in an initialized enclave. THE RWX bits of
the SECINFO parameter are treated as a permissions mask; supplying a value that does not restrict the page
permissions will have no effect. This instruction can only be executed when current privilege level is 0.
RBX contains the effective address of a SECINFO structure while RCX contains the effective address of an EPC page.
The table below provides additional information on the memory parameter of the EMODPR leaf function.
Concurrency Restrictions
40-60 Vol. 3D
SGX INSTRUCTION REFERENCES
Operation
SCRATCH_SECINFO DS:RBX;
IF (EPCM(DS:RCX).VALID is 0 )
THEN #PF(DS:RCX); FI;
Vol. 3D 40-61
SGX INSTRUCTION REFERENCES
GOTO DONE;
FI;
TMP_SECS GET_SECS_ADDRESS
IF (TMP_SECS.ATTRIBUTES.INIT = 0)
THEN #GP(0); FI;
RFLAGS.ZF 0;
RAX 0;
DONE:
RFLAGS.CF,PF,AF,OF,SF 0;
Flags Affected
Sets ZF if page is not modifiable or if other SGX2 instructions are executing concurrently, otherwise cleared. Clears
CF, PF, AF, OF, SF.
40-62 Vol. 3D
SGX INSTRUCTION REFERENCES
Description
This leaf function modifies the type of an EPC page. The security attributes are configured to prevent access to the
EPC page at its new type until a corresponding invocation of the EACCEPT leaf confirms the modification. This
instruction can only be executed when current privilege level is 0.
RBX contains the effective address of a SECINFO structure while RCX contains the effective address of an EPC
page. The table below provides additional information on the memory parameter of the EMODT leaf function.
Concurrency Restrictions
Vol. 3D 40-63
SGX INSTRUCTION REFERENCES
Operation
SCRATCH_SECINFO DS:RBX;
IF (EPCM(DS:RCX).VALID is 0)
THEN #PF(DS:RCX); FI;
40-64 Vol. 3D
SGX INSTRUCTION REFERENCES
FI;
IF (!(EPCM(DS:RCX).PT is PT_REG or
((EPCM(DS:RCX).PT is PT_TCS or PT_SS_FIRST or PT_SS_REST) and SCRATCH_SECINFO.FLAGS.PT is PT_TRIM)))
THEN #PF(DS:RCX); FI;
TMP_SECS GET_SECS_ADDRESS
IF (TMP_SECS.ATTRIBUTES.INIT = 0)
THEN #GP(0); FI;
RFLAGS.ZF 0;
RAX 0;
DONE:
RFLAGS.CF,PF,AF,OF,SF 0;
Flags Affected
Sets ZF if page is not modifiable or if other SGX2 instructions are executing concurrently, otherwise cleared. Clears
CF, PF, AF, OF, SF.