Pentium

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 18

Features of Pentium Introduced in 1993 with clock frequency ranging from 60 to 66 MHz The primary changes in Pentium Processor

or were: 64-bit data path Instruction Cache Data Cache Parallel Integer Execution units Enhanced Floating point unit Pentium Architecture

It has data bus of 64 bit and address bus of 32-bit There are two separate 8kB caches one for code and one for data.

Each cache has a separate address translation TLB which translates linear addresses to physical. There are 256 lines between code cache and prefetch buffer, permitting prefetching of 32 bytes (256/8) of instructions Four prefetch buffers within the processor works as two independent pairs. When instructions are prefetched from cache, they are placed into one set of prefetch buffers. The other set is used as when a branch operation is predicted.

Pentium is a two issue superscalar processor in which two instructions are fetched & decoded simultaneously. Thus the decode unit contains two parallel decoders which decode and issue upto next two sequential instructions into execution line. The Control ROM contains the microcode which controls the sequence of operations performed by the processor. It has direct control over both pipelines. The control unit handles exceptions, breakpoints and interrupts. It controls the integer pipelines and floating point sequences There are two parallel integer instruction pipelines: u-pipeline and v-pipeline The u-pipeline has a barrel shifter There is also a separate FPU pipeline with individual floating point add, multiply and divide operational units. The data cache is dual-ported accessible by both u and v pipeline simultaneously There is a Branch Target Buffer (BTB) supplying jump target prefetch addresses to the code cache Address generators are equivalent to segmentation unit Paging Unit is enabled by setting PG bit in CR0. It supports paging mechanism handling two linear addresses in same time to support both pipelines. Two TLBs are there associated with each cache Integer Pipeline

Pentium has 5 stage integer pipeline, branching out into two paths u and v in the last three stages. The stages are as follows:

P (Prefetch): The CPU prefetches code from code cache D1 (Instruction Decode): The CPU decodes the instruction to generate control word. A single control word causes direct execution of an instruction. Complex instructions require microcoded control sequencing

D2 (Address Generate): The CPU decodes the control word and generates addresses for data reference EX (Execute): The instruction is executed n ALU If needed, barrel shifter is used If needed, data cache is accessed

WB(Writeback):The CPU stores the result and updates the flags

Superscalar Operation of Pentium To understand the superscalar operation of u and v pipeline, we have to distinguish between simple and complex instructions. Simple instructions are entirely hardwired, do not require any microcode control and, in general, executes in one clock cycle

The exceptions are the ALU mem,reg and ALU reg,mem instructions which are 3 and 2 clock operations respectively. Sequencing hardware is used to allow them to function as simple instructions. The following integer instructions are considered simple and may be paired:

1. mov reg, reg/mem/imm 2. mov mem, reg/imm 3. alu reg, reg/mem/imm 4. alu mem, reg/imm 5. inc reg/mem 6. dec reg/mem 7. push reg/mem 8. pop reg 9. lea reg,mem 10. jmp/call/jcc near 11. nop 12. test reg, reg/mem 13. test acc, imm Integer Instruction Pairing Rules In order to issue two instructions simultaneously they must satisfy the following conditions: Both instructions in the pair must be simple. There must be no read-after-write or write-after-write register dependencies between them Neither instruction may contain both a displacement and an immediate

Instructions with prefixes can only occur in the u-pipe (except for JCC instructions )

Instruction Issue Algorithm Decode the two consecutive instructions I1 and I2 If the following are all true I1 and I2 are simple instructions I1 is not a jump instruction Destination of I1 is not a source of I2 Destination of I1 is not a destination of I2 Then issue I1 to pipeline u and I2 to pipeline v Else issue I1 to pipeline u Floating Point Unit of Pentium

The floating-point unit (FPU) of the Pentium processor is heavily pipelined. The FPU is designed to be able to accept one floating-point operation every clock. It can receive up to two floating-point instructions every clock, one of which must be an exchange instruction (FXCH). The 8 FP pipeline stages are summarized below:

1. PF Prefetch

2. D1 Instruction Decode 3. D2 Address generation


4. EX Memory and register read: This stage performs register reads or memory reads

required
5. X1 Floating-Point Execute stage one: conversion of external memory format to

internal FP data
6. X2 Floating-Point Execute stage two 7. WF Perform rounding and write floating-point result to register file 8. ER Error Reporting/Update Status Word.

The rules of how floating-point (FP) instructions get issued on the Pentium processor are :

1. FP instructions do not get paired with integer instructions.

2. When a pair of FP instructions is issued to the FPU, only the FXCH instruction can be the second instruction of the pair. The first instruction of the pair must be one of a set F where F = [ FLD,FADD, FSUB, FMUL, FDIV, FCOM, FUCOM, FTST, FABS, FCHS]. 3. FP instructions other than FXCH and instructions belonging to set F, always get issued singly to the FPU. 4. FP instructions that are not directly followed by an FP exchange instruction are issued singly to the FPU. Branch Prediction Logic

Pentium Processor uses Branch Target Buffer (BTB) to predict the outcome of branch instructions which minimizes pipeline stalls due to prefetch delays When a branch is correctly predicted, no performance penalty is incurred. But if prediction is not correct, it causes a 3 cycle penalty in U pipeline and 4 cycle penalty in V pipeline. When a call or condition jump is mispredicted, a 3 clock penalty is incured BTB is a cache with 256 entries. The directory entry for each line contains the following information: A Valid Bit that indicates whether or not the entry is in use. History Bits that tracks how often the branch is taken Source memory address that the branch instruction was fetched from BTB sits off to the side of D1 stages of two pipelines and monitors for branch instructions The first time a branch instruction enters a pipeline, BTB uses its source address to perform a lookup in cache and this results in BTB miss. When instruction reaches the execution stage, the branch will be either taken or not taken. If taken, the next instruction should be fetched from the branch target address When a branch is taken for first time, the execution unit provides feedback to branch prediction logic and branch target address is recorded in BTB A directory entry is made containing source memory address and the history bits. The history bit indicates one of the 4 possible states:

Strongly Taken Weakly Taken Weakly Not Taken Strongly Not Taken

Strongly Taken: The history bits are initialized to this state when the entry is made first If a branch marked weakly taken is taken again, it is upgraded to strongly taken When a branch marked strongly taken is not taken the next time, it is downgraded to weakly taken

Weakly Taken If a branch marked weakly taken is taken again, it is upgraded to strongly taken When a branch marked weakly taken is not taken the next time, it is downgraded to weakly not taken

Weakly Not Taken If a branch marked weakly not taken is taken again, it is upgraded to weakly taken When a branch marked weakly not taken is not taken the next time, it is downgraded to strongly not taken

Strongly Not Taken If a branch marked strongly not taken is taken again, it is upgraded to weakly not taken When a branch marked strongly not taken is not taken the next time, it remains in strongly not taken state

During D1 stage of decode, if branch prediction is not taken, no action is taken at this point. If prediction taken, the BTB supplies branch target address back to the prefetcher and indicates a positive prediction is being made. In response, prefetcher switches to opposite prefetch queue and immediately begins to prefetch from branch target address During execution stage, the branch will either be taken or not. The results of the branch are fed back to the BTB and histroy bits are upgraded or downgraded accordingly.

Cache Organization of Pentium Pentium employs two separate internal cache memories: one for instruction and other for data.

Cache Background Cache is a special type of high speed RAM, that is used to : help speedup access to memory and reduce traffic on processors busses. An on-chip cache is used to feed instructions and data to the CPUs pipeline An external cache is used to to speedup main memory access.

Two characteristics of a running program pave the way for performance improvement when cache is used: Temporal Locality: When we access a memory location, there is a good chance that we may access it again. Spatial Locality: When we access one location, there is a good chance that we access the next location.

Consider the following loop of instructions: MOV CX,1000 SUB AX,AX NX: ADD AX,[SI] MOV [SI], AX INC SI LOOP NX

The loop will get executed 1000 times. If the cache is initially empty, each instruction fetch generates a miss in cache and it is read from main memory. The next 999 passes will generate hits for each instruction and the speed is improved. When a miss occurs, the cache reads a copy of group of locations and this group is called a line of data

During data accesses (ADD AX,[SI] & MOV [SI],AX) , a miss causes a line of data to be read resulting in faster data access. In data write, it depends on the policy used by a particular system. There are 2 policies:

1. Writeback: Write results only to cache It results in fast writes with out-of-date main memory data 2. Writethrough: Write results to cache and main memory It maintains valid data in main memory but slows down the execution When cache is full, a line must be replaced. One algorithm used to replace the victim line is called LRU (Least Recently Used) One or more bits are added to cache entry to support LRU and these bits are updated during hits and examined when a victim must be chosen.

Cache Organization

It deals with how a cache with numerous entries can search them so quickly and report a hit if a match is found. Cache may be organized in different hardware configurations. The 3 main designs are: Directly Mapped, Fully Associative &,Set Associative It uses a portion of incoming physical address to select an entry. A tag stored in entry is compared with remaining address bits and a match represents a hit

Direct Mapped Cache

Fully Associative Cache

Set Associative Cache It combines both the concepts. The entries are divided into sets containing 2, 4 8 or more entries Two entries per set is called two way set associative. Each entry has its own tag. A set is selected using its index

Cache Organization of Pentium The data and instruction cache are organized as two way set associative caches with 128 sets. This gives 256 entries per cache There are 32 bytes in a line resulting in 8KB of storage per cache. An LRU algorithm is used to select victims when cache is full

Internal Structure of Cache

The tags in data cache are triple ported (can be accessed from 3 different places at the same time ) Two of these are for u and v pipelines. The third port is for a special operation called bus snooping(It is used to maintain consistent data in a multiprocessor system) Each entry in the data cache can be configured for writethrough or writeback. The instruction cache is write protected to prevent self-modifying code from changing the executing program. Tags in instruction cache are also triple ported, with two ports for split line access(upper half and lower half of each line are read simultaneously) and third port for bus snooping Parity bits are used in each cache to maintain data integrity Each tag has its own parity bit and each byte has parity bit.

Translation Lookaside Buffer TLB translates logical address to physical address(same address of main memory) TLBs are caches themselves. The data cache has 2 TLBs. First one is 4 way set associative with 64 entries. The lower 12 bits are same.

The upper 20 bits of virtual address are checked against 4 tags and translated to upper 20 physical address during hit.

The second TLB is 4 way set associative with 8 entries and handles 4MB pages Both TLBs are parity protected and dual ported The instruction cache uses a single 4way set associative TLB with 32 entries. Both 4KB and 4MB pages are supported. Parity bits are used to maintain data integrity. Entries in all TLBs use a 3-bit LRU counter

Cache Coherency in Multiprocessor System

When multiple processors are used in a single system, there needs to be a mechanism whereby all processors agree on the contents of shared cache information. For e.g., two or more processors may utilize data from the same memory location,X. Each processor may change value of X, thus which value of X has to be considered?

A multiprocessor system with incoherent cache data

The Intels mechanism for maintaining cache coherency in its data cache is called MESI (Modified/Exclusive/Shared/Invalid)Protocol. This protocol uses two bits stored with each line of data to keep track of the state of cache line. The four states are defined as follows: Modified: The current line has been modified and is only available in a single cache.

Exclusive: The current line has not been modified and is only available in a single cache Writing to this line changes its state to modified

Shared: Copies of the current line may exist in more than one cache. A write to this line causes a writethrough to main memory and may invalidate the copies in the other cache

Invalid: The current line is empty

A read from this line will generate a miss A write will cause a writethrough to main memory

Only the shared and invalid states are used in code cache. MESI protocol requires Pentium to monitor all accesses to main memory in a multiprocessor system. This is called bus snooping. Consider the above example. If the Processor 3 writes its local copy of X(30) back to memory, the memory write cycle will be detected by the other 3 processors. Each processor will then run an internal inquire cycle to determine whether its data cache contains address of X. Processor 1 and 2 then updates their cache based on individual MESI states. Inquire cycles examine the code cache as well (as code cache supports bus snooping) The Pentiums address lines are used as inputs during an inquire cycle to accomplish bus snooping.

Cache Instructions Three instructions are provided to allow the programmer some control over cache operation These instructions are: INVD(Invalidate Cache) INVLPG(Invalidate TLB entry) WBINVD(Write back and invalidate cache)

INVD effectively erases all information in the data cache. Any values not previously written back will be lost when INVD executes. This problem can be avoided by using WBINVD which first writes back any updated cache entries and then invalidates them INVLPG invalidates the TLB entry associated with a supplied memory operand

All these cache operations are performed automatically by the Pentium. No programming code is needed to get the cache work. Bus Operations Some of the operations over its address and data busses are :

Data transfers(single or burst cycles) Interrupt Acknowledge cycles Inquire cycles I/O operations Decoding a Bus Cycle

There are 6 possible states the Pentium bus may be in and they are TI, T1, T2, T12, T2P and TD TI is the idle state and indicates that no bus cycle is currently running. The bus begins in this state after reset. During the first state T1, a valid address is output on the address lines. During the second state T2, data is read or written. T12 state indicates that the processor is starting the second bus cycle at the same time that data is being transferred for the first. State T2P continues the bus cycle started in T12. TD is used to insert a dead state between two consecutive cycles to give time for the system bus to change states

The bus state controller follows a predefined set of transitions in the form of state diagram shown:

The transitions between states are defined as follows:

(0) No Request Pending (1) New bus cycle started & ADS# is asserted (2) Second clock cycle of the current bus cycle (3) Stays in T2 until BRDY# is active or new bus cycle is requested (4) Go back to T1 if a new request is pending (5)Bus Cycle complete: go back to idle state (6) Begin second bus cycle (7) Current cycle is finished and no dead clock is needed (8) A dead clock is needed after the current cycle is finished (9) Go to T2P to transfer data

(10) Wait in T2P until data is transferred (11) Current cycle is finished and no dead clock is needed (12) A dead clock is needed after the current cycle is finished (13) Begin a pipelined bus cycle if NA is active (14) No new bus cycle is pending

************************************************************************ ************************************************************************

You might also like