Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Contents

1 2

Introduction Architectural Features 2.1 The Pipe Line 2.2 The Processor Functionality 2.2.1 The Instruction Processing Block 2.2.2 The Execution Block 2.2.3 The Control Block 2.2.4 The Memory Subsystem 2.2.4 The IA-32

1 2 2 3 5 5 6 7 8 9 10

The Cache Structures

References

Intel Itanium 2
1. Introduction
The Itanium processor is the first implementation of the IA-64 (Itanium Architecture) instruction set architecture (ISA). The processor is designed to meet a wide range of requirements like high performance on Internet servers and workstations, support for 64bit addressing, reliability for mission-critical applications, full IA-32 instruction set compatibility in hardware which is a backward compatibility issue and scalability across a range of operating systems and platforms. The multi-threading and multi-core features it incorporates enabled it to have increased performance, more throughputs and ideal price compared to the performance it gives. Below are the features that Intel Itanium has for its best performance: It employs EPIC (Explicitly Parallel Instruction Computing) for scaling performance: increasing ILP (instruction-level parallelism). Current Itanium processors can execute up to 6 instructions in parallel. It has 6-wide, 8-stage deep pipeline running at 1.5 GHz. It has the following resources: 6 integer and 6 multimedia ALUs (Arithmetic and Logic Units), 2 load and 2 store units, 3 branch units, 2 extended-precision and 2 single-precision FP units 128 integer registers, 128 FP registers, 8 branch registers, 64 predicate reg. Can fetch, issue, execute, and retire 6 instructions (2 bundles) / clock. Three levels of on-die cache: minimize memory latency. Uses dynamic pre-fetch, branch prediction, register scoreboard. System bus for MP support (up to 4 proc/bus): building block

The pipeline of the Itanium 2 processor can execute up to six instructions simultaneously (i.e. 6-wide) and has eight stages. The highest clock rate available now is 1.5 GHz, but there are also 1.4 GHz, 1.3 GHz and 1 GHz processors. The Itanium 2 processor is
1

designed to perform highly parallel computing, and in order to accomplish that, it is built with a set of execution resources that include: 6 integer and 6 multimedia ALUs, 2 load and 2 store units, 3 branch units, 2 extended-precision and 2 single-precision Floating Point units. The Itanium 2 processor can fetch, issue, execute and retire 6 instructions (2 bundles) per clock cycle. It features a multilevel cache of 3 levels to minimize the memory latency. The processor make use of dynamic pre-fetching, branch prediction and register score boarding.

2. Architectural Features
The key architectural features of Intel Itanium 2 are instruction-level parallelism, large number of registers, speculation, predication, register stack, soft pipelining, advanced floating point and large caches.

2.1 The pipe line

Exception Detection Write Back

Instruction Fetch Instruction Rotation

Expand Register Register Rename Read 2

The figure shows the core pipeline of the Itanium 2 processor. The Itanium pipeline has eight stages and is able to execute up to six instructions (three bundles) in parallel per clock cycle. The first two pipeline stages deal with instruction fetch (IPG) and placement of the instructions into a decoupling buffer during the rotation stage (ROT). The process of bringing new bundles into the two-bundle issue window is called bundle rotation. Bundle rotations occur when all the instructions within a bundle are issued. This separation allows the front and back ends of the pipeline to work independently. The bold line between ROT and EXP stages shows the point of decoupling.

The next two stages perform dispersal or expansion (EXP) and register renaming (REN). The process of mapping instructions within bundles to functional units is called dispersal. Each instruction is dispersed into one of the execution pipelines, based on its type. The instruction types in Itanium are: ALU integer (A), Non-ALU integer (I), Memory (M), Floating-point (FP), Branch (B), and Extended (L). Operands access is done during the register read (REG) stage. The register file is accessed here and data is sent through the bypass network once the predicate control has executed.

The last three stages perform parallel execution, exception handling and retirement. The exception detection (DET) stage includes branch resolution, memory exception management, and speculation support.

2.2 The processor functionality


The following figure shows the block diagram of Intel Itanium 2 Processor.
Control Block Instruction Processing IA-32 Execution Engine

Memory Subsystems Execution Block

The processors functionality is divided into five groups: Instruction Processing, Execution, Control, Memory Subsystem, and IA-32 Execution Engine.

2.2.1

The Instruction Processing Block contains the logic for instruction prefetch, instruction fetch, L1 instruction cache, branch prediction, instruction address generation, instruction buffers, instruction issue, dispersal and rename:

Instruction prefetch is defined as the process of moving instruction cache lines from higher levels of cache or memory into L1 instruction cache. The processor performs speculative instruction prefetches based on a complex branch prediction strategy and hints specified by the compiler. The processor reads two instruction bundles from the L1 instruction cache and put them in the 8-bundle instruction buffer. They stay in the instruction buffer until assigned to one of the pipelined functional units. Once the instructions are read from the instruction buffer, they go to the instruction issue and renaming logic. The instruction address generator unit selects the next IP among: the next sequential address, static and dynamic branch prediction addresses, addresses from the compatibility logic engine, validated address for wrongly predicted branch correction, and address from exception handlers.

2.2.2

The Execution Block has the multimedia logic, integer ALU execution logic, floating point execution logic, integer register file, L1 data cache, and FP register file:

After instructions are processed, they will get executed in subsequent pipeline stages. The Itanium 2 processor execution logic contains abundant resources to sustain a high level of parallelism. These execution resources include: 6 multimedia functional units, 6 integer functional units, 4 load/store units, 3 branch units, and 2 floating-point functional units. Also, Itanium 2 provides a large register file with 128 integer registers, 128 floating-point registers, 8 branch registers, and 64 predicate registers. Only integer loads are processed by L1
5

cache. Integer stores as well as floating-point loads and stores are served by L2 cache. Every lookup in L1 generates a speculative request to L2 cache.

The multimedia engines treat 64-bit data as one of the following packed structures: 2 32-bit data types, 4 16-bit data types, or 8 8-bit data types. Three classes of operations can be performed on the packed (SIMD- Single Instruction Multiple Data) data types: arithmetic, shift and data arrangement. On the other hand, the integer engines support up to six non-packed integer arithmetic and logical operations. At most six integer or multimedia operations can be executed each cycle.

2.2.3

The Control Block contains the exception handler, pipeline control (scoreboard, predication), and register stack engine (RSE).

The control logic is formed by the exception handler and the pipeline control. The exception handler implements exception prioritizing. The pipeline control has a scoreboard that detects dependencies, supports data speculation, and tracks multicycle operations such as first-level instruction cache (L1D) misses, multimedia, and floating-point operations. The pipeline control supports predication via predication registers. It also contains a Performance Monitoring unit which was designed to collect data for analyzing processor performance.

2.2.4

The Memory Subsystem includes the unified L2 cache, on-chip L3 cache, interrupt controller, instruction and data translation look aside buffers (ITLB), advanced load address table (ALAT), and external system bus interface logic.

The L1 instruction cache size is 16 KB. It is a non-blocking dual-ported 4-way set-associative cache with 64-byte line size. One port is for instruction fetches and the other port is for prefetches, snoops, fills, and column invalidates. L1i cache is physically indexed and tagged. It is fully pipelined and can deliver 2 instruction bundles per clock cycle.

The L1 data cache size is also 16 KB. It is a non-blocking 4-way set-associate cache with 64-byte line size and 4 ports (2 for reads and 2 for writes). L1d only holds integer data. It is write-through and uses no-write-allocate strategy. L1d cache is physically indexed and tagged.

The 256 KB unified L2 cache is four-ported, non-blocking, and 8-way setassociative with 128-byte line size. L2 read bandwidth is 64 GB/sec. It is a writeback cache with write-allocate policy. It is physically indexed and tagged. L2 cache handles all floating-point memory accesses (max. 4 FP loads per clock), and semaphore instructions.

The on-chip L3 cache is 1.5 MB or 3 MB in size (the most recent processor actually has a 6 MB L3 cache). It is non-blocking, single-ported, and fully pipelined. L3 cache is 12-way set-associative with 128-byte line size. It is physically indexed and tagged. The maximum transfer rate from L3 to L1 or L2 is 32 GB/cycle. It protects both tag and data with a single bit correction and double bit detection ECC. The ALAT is a cache structure that enables data speculation. The TLB holds virtual to physical mappings.

2.2.5

The IA-32 execution engine fetch, decode, and schedule IA-32 instructions providing backwards compatibility.

3. Cache Structure
The cache is very high speed memory for data that get reused. All the caches are organized into cache lines because access of a single element brings enough adjacent elements to fill the line (64/128 consecutive bytes) and also because there is an underlying assumption that if you need one element then you will need its neighbors soon. Cache lines are organized into associative sets or ways which give that hardware more flexibility in allowing algorithm replacement.

L1D Cache structure 64 byte cache line 64 byte cache line 64 byte cache line 64 byte cache line

L2 Unified Cache Bank structure

Line Line Line Line

1 3 5 7

Of Of Of Of

8 8 8 8

Line Line Line Line

2 4 6 8

Of Of Of Of

8 8 8 8

Cache Lines

Cache Lines

16 bank each 16 byte wide


9

References
1. ITANIUMPROCESSOR MICROARCHITECTURE (PDF) https://tuubi.metropolia.fi/portal/en/group/tuubi 2. Intel Itanium Architecture http://www.gelato.org/pdf/Illinois/gelato_IL2004_architecture_moore.pdf 3. Intel Itanium Architecture, Software developer's manual http://www.intel.com/Assets/PDF/manual/324091.pdf

10

You might also like