Aca Notes

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 23

Architecture vs.

Organization Architecture: Also known as Instruction Set Architecture (ISA) Programmer visible part of a processor: instruction set, registers, addressing modes, etc. Organization: High-level design: How many caches? How many arithmetic and logic units? What type of pipelining, control design, etc. Sometimes known as micro-architecture Computer Architecture The structure of a computer that a machine language programmer must understand: To be able to write a correct program for that machine. A family of computers of the same architecture should be able to run the same assembly language program. Architecture leads to the notion of binary compatibility. Amdahls Law and Speedup Speedup tells us: How much faster a machine will run due to an enhancement. For using Amdahls law two things should be considered: 1st Fraction of the computation time in the original machine that can use the enhancement If a program executes in 30 seconds and 15 seconds of exec. uses enhancement, fraction = 2nd Improvement gained by enhancement If enhanced task takes 3.5 seconds and original task took 7secs, we say the speedup is 2. Amdahls Law: Example Floating point instructions improved to run 2 times faster. But, onl y 10% of actual instructions are FP ExTimenew = ? Speedupoverall=?

Amdahl s Law: Example

Floating point instructions improved to run 2X;

But only 10% of actual instructions are FP.

ExTime

new

= ExTime

old

x (0.9 + 0.1/2) = 0.95 x 1 0.95 =

ExTime 1.053

old

Speedup

overall

36

Instruction Set Architecture (ISA) if Programmer visible part of a processor: Instruction Set (what operations can be performed?) Instruction Format (how are instructions specified?) Registers (where are data located?) Addressing Modes (how is data accessed?) Exceptional Conditions (what happens something goes wrong?) ISA is important: Not only from the programmers perspective. From processor designer and implementer perspectives as well. Different Types of ISAs Determined by the means used for storing data in CPU: The major choices are: A stack, an accumulator, or a set of registers. Stack architecture: Operands are implicitly on top of the stack. Accumulator architecture: One operand is in the accumulator (register) and the others are elsewhere. Essentially this is a 1 register machine

Found in older machines General purpose registers: Operands are in registers or specific memory locations.

Comparison of Architectures Consider the operation: C =A + B


Stack Push A Push B Add Pop C Accumulator Load A Add B Store C Register-Memory Load R1, A Add R1, B Store C, R1 Register-Register Load R1, A Load R2, B Add R3, R1, R2 Store C, R3

How to Improve Processor Performance? Operate on multiple data at the same time: Data parallelism Operate on multiple operations at the same time: Operation Parallelism RISC: Reduced Instruction Set Computer CISC: Complex Instruction Set Computer Genesis of CISC architecture: Implementing commonly used instructions in hardware can lead to significant performance benefits. For example, use of a FP processor can lead to performance improvements. Genesis of RISC architecture: The rarely used instructions can be eliminated to save chip space --- on chip cache and large number of registers can be provided. Features of A CISC Processor Rich instruction set: Some simple, some very complex Complex addressing modes: Orthogonal addressing (Every possible addressing mode for every instruction). Many instructions take multiple cycles: Large variation in CPI Instructions are of variable sizes Small number of registers Microcode control No (or inefficient) pipelining

example
One instruction could do the work of several instructions. For example, a single instruction could load two numbers to be added, add them, and then store the result back to memory directly. Many versions of the same instructions were supported: Different versions did almost the same thing with minor changes. For example, one version would read two numbers from memory, and store the result in a register. Another version would read one number from memory and the other from a register and store the result to memory. Features of a RISC Processor Small number of instructions Small number of addressing modes Large number of registers (>32) Instructions execute in one or two clock cycles Uniformed length instructions and fixed instruction format. Register-Register Architecture: Separate memory instructions (load/store) Separate instruction/data cache Hardwired control Pipelining (Why CISC are not pipelined?)

Synchronous Pipeline
- Transfers between stages are simultaneous. - One task or operation enters the pipeline per cycle.
L
Input

L S1 S2

L Sk

L
Output

Clock

d
107

Asynchronous Pipeline
- Transfers performed when individual stages are ready. - Handshaking protocol between processors.
Input Ready Ack

S1

Ready Ack

S2

Ready Ack

Sk

Output Ready Ack

- Different amounts of delay may be experienced at different stages. - Can display variable throughput rate.
108

Data Dependencies : Summary


Data dependencies in straight-line code RAW Read After Write dependency
( Flow dependency )

WAR Write After Read dependency


( Anti dependency )

WAW Write After Write dependency


( Output dependency )

Load-Use dependency

Define-Use dependency

True dependency
Cannot be overcome

False dependency
Can be eliminated by register renaming
31

Two Paths to Higher ILP

Superscalar processors:
Multiple

issue, dynamically scheduled, speculative execution, branch prediction More hardware functionalities and complexities.

VLIW:
Let

complier take the complexity. Simple hardware, smart compiler.


242

Superscalar Execution

Scheduling of instructions is determined by a number of factors:


True Data Dependency: The result of one operation is an input to the next. Resource constraints: Two operations require the same resource. Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. Superscalar processor of degree m.
243

An appropriate number of instructions issued.

Very Long Instruction Word (VLIW) Processors Hardware cost and complexity of superscalar schedulers is a major consideration in processor design. VLIW processors rely on compile time analysis to identify and bundle together instructions that can be executed concurrently. These instructions are packed and dispatched together, Thus the name very long instruction word. This concept is employed in the Intel IA64 processors. VLIW Processors The compiler has complete responsibility of selecting a set of instructions: These can be concurrently be executed. VLIW processors have static instruction issue capability: As compared, superscalar processors have dynamic issue capability. The Basic VLIW Approach VLIW processors deploy multiple independent functional units. Early VLIW processors operated lock step: There was no hazard detection in hardware at all. A stall in any functional unit causes the entire pipeline to stall. VLIW Processors Assume a 4-issue static superscalar processor: During fetch stage, 1 to 4 instructions would be fetched. The group of instructions that could be issued in a single cycle are called: An issue packet or a Bundle. If an instruction could cause a structural or data hazard:

It is not issued. VLIW Processors: Some Considerations Issue hardware is simpler. Compiler has a bigger context from which to select co-scheduled instructions. Compilers, however, do not have runtime information such as cache misses. Scheduling is, therefore, inherently conservative. Branch and memory prediction is more difficult. Typical VLIW processors are limited to 4-way to 8-way parallelism. VLIW Summary Each instruction is very large Bundles multiple operations that are independent. Complier detects hazard, and determines scheduling. There is no (or only partial) hardware hazard detection: No dependence check logic for instructions issued at the same cycle. Tradeoff instruction space for simple decoding The long instruction word has room for many operations. But have to fill with NOP if enough operations cannot be found. VLIW vs Superscalar VLIW - Compiler finds parallelism: Superscalar hardware finds parallelism VLIW Simpler hardware: Superscalar More complex hardware VLIW less parallelism can be exploited for a typical program: Superscalar Better performance Superscalar Processors Commercial desktop processors now do four or more issues per clock: Even in the embedded processor market, dual issue superscalar pipelines are becoming common Vector Processing A typical instruction might add two 64 element FP vectors. Commercialized long before What is a Vector Processor? A vector processor supports high-level operations (add, subtract, multiply, etc) on vectors. SIMD processing ILP machines. Why Vector Processors? One vector instruction is equivalent to executing an entire loop: Reduces instruction fetch and decode overheads and bandwidth. Each instruction is guaranteed to be independent of other instructions in the same vector: No data hazard check needed in an instruction. Executed using an array of functional units, or a deep pipeline. Hardware needs to only check for data hazards between two instructions: Once per two vector instructions.

More instructions handled per data check. Memory access for entire vector, not a single word. Reduced Latency Multiple vector instructions in progress. Further parallelism Basic Vector Architectures Two Types: Vector-register: All operations except load and store based on registers. Memory-memory: All operations are memory to memory. A vector register: Fixed length, holds a single vector Issue 1: Memory Bandwidth Problem: Memory system needs to be able to produce and accept large amounts of data. But how do we achieve this when there is poor access time? Solution: Creating multiple memory banks. Also, useful for fragmented accesses. Supports multiple loads per clock cycle. Issue 2: Vector Length Problem: How do we support operations where the length is unknown or not the same as vector length? Solution: Provide a vector-length register, solves problem only if real length is less than Maximum Vector Length. Use Technique Called strip mining. Vector Length Register (VLR) A vector register can hold some maximum number of elements for each data width maximum vector length or MVL. What to do when the application vector length is not exactly MVL? Vector-length register(VLR) controls the length of any vector operation, including a vector load or store E.g. vadd with VL=10 is for (I=0; I<10; I++) V1[I]=V2[I]+V3[I] VL can be anything from 0 to MVL. How do you code an application where the vector length is not known until runtime? Strip Mining Helps handle vector operations for sizes greater than MVL. Creates 2 loops: One that handles any number of iterations multiple of MVL.

Another that handles the remaining iterations. Code becomes vectorizable. Careful handling of VLR needed Example: Strip Mining low=1; /*Assume start element at 1*/ vL = n % mvL; /*find the odd size piece */ for(j=0; j<=n/mvL; j++){ /*Outer Loop*/ for(i=low; i<=low+vL-1;i++){ /*Inner loop-runs for length vL*/ y[i] = a*x[i] + y[i]; /*Start of next vector*/ } low = low + vL; /*Find start of next vector*/ vL = mvL; /* reset length to max */ } Present Applications of Vector Processors: Media Processing Desktop: 3D graphics (games) Speech recognition (voice input) Video/audio decoding (mpeg-mp3 playback) Servers: Video/audio encoding (video servers, IP telephony) Digital libraries and media mining (video servers) Computer animation, 3D modeling & rendering (movies) Embedded: 3D graphics (game consoles) Video/audio decoding & encoding (set top boxes) Image processing (digital cameras) Signal processing (cellular phones) Superscalar Versus Vector Processing A vector processor can efficiently exploit parallelism from regular code: Matrix operations. Multimedia operations, scientific computations, etc. A superscalar processor can exploit reasonable amount of parallelism in less structured code: Typical programs. Early Intel Microprocessors Intel 8080 64K addressable RAM 8-bit registers CP/M operating system S-100 BUS architecture 8-inch floppy disks! Intel 8086/8088 Used in IBM-PCs 1 MB addressable RAM 16-bit registers

16-bit data bus (8-bit for 8088) separate floating-point unit (8087) x86 Microprocessors Intel 8086, 80286 IA-32 processor family P6 processor family Netburst family x86 Processor History IA-16 Processors 8086 - Intel's first 16bit PC microprocessor. 8088 - A minor refinement of the 8086. 80186 - An extension to the 8086. 80286 - A reasonably successful extension to the 8086. IA-32 Processors 80386 - Intel's first (32-bit, protected mode) processor. 80486 - A much improved 80386, use of instruction pipe. 80586 - A much improved 80486, named the Pentium. 80686 - An improved 80586, named the Pentium Pro. 80586+MMX - A refined 80586, faster, and with MMX extensions, named the Pentium MMX. 80686+MMX - A refined 80686, faster, and with MMX extensions, named the Pentium II. 80686+MMX+SSI - A refined 80686+MMX, with SSE extensions, named the Pentium III. IA-64 Processors Newer "Pentium" processors, Intel's attempt at a 64-bit architecture Intel IA-32 Family Intel 386: 1985 4 GB addressable RAM, 32-bit registers, paging (virtual memory). Intel 486: 1989 Instruction pipelining Pentium: 1993 Superscalar, 32-bit address bus, 64-bit internal data path. IA-32 8086 began an evolution: Eventually resulted in IA-32 family of object code compatible microprocessors. IA-32 is a CISC architecture: Variable length instructions and complex addressing modes. Turned out to be the most dominant architecture of the time in terms of sales volume. 1985: Intel 386 1989: First pipelined version of IA-32 family Intel 486 was introduced.

386 and onto 486 80386 was first IA-32 implementation: Included several architectural improvements in addition to the wider data path. Perhaps the most important feature was extension of the virtual memory architecture: Includes both the segmentation used in the 80286 and paging --- the preferred technique in the Unix world. 486 entirely improved 386: Pipelined With an on-chip floating point unit. Intel 486 5-Stage CISC Pipeline Stage name Function performed 1. Instruction Fetch 2. Instruction Decode-1 3. Instruction Decode-2 4. Execute 5. Register Write-back Fetch instruction from the 32-byte prefetch queue Translate instr. into control signals or microcode addr. Initiate addr. generation and memory access Access microcode memory Outputs microinstruction to execution unit Execute ALU and memory accessing operations Write back results to register

i486 Pipeline Fetch Load 16-bytes of instruction into prefetch buffer Decode1 Determine instruction length, instruction type Decode2 Compute memory address Generate immediate operands Execute Register Read ALU operation Memory read/write Write-Back Update register file A Reflection on 486 Pipeline Two Decoding Stages: Harder to decode CISC instructions. Inevitable due to microcoded control. Effective address calculation in D2.

Multicycle Decoding Stages: For more difficult decodings. Stalls incoming instructions 486 vs. 386 Cycles Per Instruction Instruction Type 386 Cycles 486 Cycles Load 4 1 Store 2 1 ALU 2 1 Jump taken 9 3 Jump not taken 3 1 Call 9 3 Reasons for Improvement: On chip cache Faster loads & stores Deeper pipeline Pentium Block Diagram

Pentium Overview Architecturally Pentium is vastly different from 486. Pentium is essentially one full 486 execution unit (EU) (called U pipe): Plus a second stripped down unit called (V pipe) The two pipes are capable of executing instructions simultaneously:

Separate write buffers and even simultaneous access to data cache. This is how pentium is superscalar of degree two. How can Pentium supply data and instructions at a much faster rate? At least twice as fast as 486? 486 has a single 8K L1 data/instruction cache: Pentium has two separate 8K L1 caches, one for code and the other for data. Also pentium expands 486s 32 byte prefetch queue to 128 bytes.

Pentium Pipeline
Fetch & Align Instruction Decode Instr. Generate Control Word Decode Control Word Generate Memory Address Access data cache or calculate ALU result Write register result Decode Control Word Generate Memory Address Access data cache or calculate ALU result Write register result

U-Pipe

V-Pipe
305

Intel P6 Family Pentium Pro (1995) Pentium II MMX (multimedia) instruction set Pentium III SIMD (streaming extensions) instructions Pentium 4 and Xeon Intel NetBurst micro-architecture, tuned for multimedia. The P6 Microarchitecture Forms the basis of Pentium Pro, Pentium II and Pentium III: Besides some specialized instruction set extensions (MMX and SSE), these processors differ in clock rate and cache architecture. Dynamically scheduled processor: Translates each IA-32 instruction to a series of micro-operations (uops). uops are similar to typical RISC instructions. Hardwired control unit

P6 Microarchitecture
16 bytes/cycle

Instruction Fetch

Instruction Decode
16 bytes 3 instrs/cycle

6 uops

Execution Units
(5 total) Reservation Stations (20)

3 uops/cycle

Renaming & Issue

Graduation Unit
Reorder Buffer (40 entries) 3 uops/cycle

311

PentiumPro (1995) Supports predicated instructions. Instructions decoded into micro-operations (ops): ops are register-renamed, Placed into an out-of-order speculative pool of pending operations. Executed in dataflow order (when operands are ready). Pentium II/III The Pentium II/III processors use P6 microarchitecture: Three-way superscalar, Pipelined micro-architecture features a 12-stage superpipeline. Trades less work per pipe stage for more stages -- achieving higher clock rate.

Pentium and Pentium II/III Microarchitecture

317

External Bus

L2 Cache

Pentium II/III

Bus Interface Unit

Memory Reorder Buffer D-cache Unit Memory Interface Unit Reservation Station Unit

Instruction Fetch Unit (with I-cache)

Branch Target Buffer

Functional Units

Instruction Decode Unit

Microcode Instruction Sequencer

Register Alias Table

Reorder Buffer & Retirement Register File

318

Pentium 4 Was announced in mid-2000 Native IA-32 instructions NetBurst micro-architecture. 20 pipeline stages (integer pipeline). Original clock at 1.5 GHz. 42 million transistors Very deep, out-of-order, speculative execution engine Up to 126 instructions in flight (3 times larger than the Pentium III processor). Up to 48 loads and 24 stores in pipeline (2 times larger than the Pentium III processor). Branch Prediction 4K entry branch target array: 8 times larger than the Pentium III processor. New prediction algorithm (not specified): Reduces mispredictions compared to P6 by about one third. Second Level Cache Included on the die size: 256 kB Unified 8-way associative 256-bit data bus to the level 2 cache Delivers ~45 GB/s data throughput (at 1.4 GHz processor frequency) Bandwidth and performance increases with processor frequency A Commercial Superscalar Processor PowerPC: Eleven pipelined functional units: 4 IUs, an FPU with a separate floating point register file, It is capable of executing sixteen instructions simultaneously AMD Athlon Advanced Micro Devices have carved themselves a niche in the Intel Architecture market: with their line of instruction set compatible processors. The latest AMD offering is the Athlon family of processors Athlon The micro-architecture of the Athlon family is of considerable interest, In many respects more powerful than Pentium core. Athlon uses three integer units, three floating point units and three address calculation units: For a total of nine execution units. Ability to issue 9 operations concurrently: Three integer, three address, and three floating point. A 10 stage integer and 15 stage floating pipeline are used. The floating point execution units can perform Intel SIMD MMX instructions as well as AMD 3DNow! instructions. Like Intel's Pentium family, the AMD Athlons use a "RISC-like" core,

Intel CISC instructions are decoded by a three way Instruction Decoder into fixed length "MacroOPs", Fed into the Instruction Control Unit, which has a 72 entry Reorder Buffer. Branch prediction is performed using a two-way 2048 entry branch prediction table: A branch target address table and return address stack. ARM ARM Ltd. (Formerly Advanced RISC Machines) Licenses its design to vendors: IBM, Intel, Philips, Samsung, TI, etc. 32-bit processor architecture: Widely used in embedded systems: Mobile phones, PDAs, Calculators, Routers, media players, etc. Low power consumption is one of the critical goals. ARM Architecture Load/Store 16 32-bit registers Predicated execution of most of the instructions.

MEMORY

Capacity Access Time Cost

Levels of the Memory Hierarchy Upper Level


Staging Transfer Unit

faster

CPU Registers 100s Bytes <10 ns Cache K Bytes 10-100 ns 1-0.1 cents/bit Main Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit Disk G Bytes, 10 ms (10,000,000 ns)
-6 10-5 - 10 cents/bit

Registers Instr. Operands Cache Cache Lines Memory Pages Disk Files Tape
User Mbytes Operating system 512-4K bytes Cache controller 8-128 bytes Lecture This Compiler 1-8 bytes

Tape infinite sec-min 10 -8

Larger Lower Level


6

Memory Issues Latency Time to move through the longest circuit path (from the start of request to the response) Bandwidth Number of bits transported at one time Capacity Size of memory Energy Cost of accessing memory (to read and write) Cache Write Policies Write-through: Information is written to both the block in the cache and that in memory Write-back: Information is written back to memory only when a block frame is replaced: Uses a dirty bit to indicate whether a block was actually written to, Saves unnecessary writes to memory when a block is clean Trade-offs Write back Faster because writes occur at the speed of the cache, not the memory. Faster because multiple writes to the same block is written back to memory only once, uses less memory bandwidth. Write through Easier to implement Write Allocate, No-write Allocate What happens on a write miss? On a read miss, a block has to be brought in from a lower level memory Two options: Write allocate: A block allocated in cache. No-write allocate:no block allocation, but just written to in main memory. In no-write allocate, Only blocks that are read from can be in cache. Write-only blocks are never in cache. But typically: write-allocate used with write-back no-write allocate used with write-through Why does this make sense

Write-Through Policy

0x5678

0x1234 0x1234

Processor

Cache Memory

33

Write Buffer
Processor Cache DRAM Write Buffer

Processor: writes data into the cache and the write buffer Memory controller: writes contents of the buffer to memory Typically 4 to 8 entries Desirable: Occurrence of Writes << DRAM write cycles Write buffer saturation (i.e., Writes DRAM write cycles)
34

Write buffer is a FIFO structure:


Memory system designers nightmare:

Writeback Policy

0x5678 0x9ABC

0x1234 0x5678 0x1234

Processor

Cache Memory

35

Unified vs Split Caches A Load or Store instruction requires two memory accesses: One for the instruction itself One for the data Therefore, unified cache causes a structural hazard! Modern processors use separate data and instruction L1 caches: As opposed to unified or mixed caches The CPU sends simultaneously: Instruction and data address to the two ports . Both caches can be configured differently Size, associativity, etc.

Unified vs Split Caches

Separate Instruction and Data caches:

Avoids structural hazard Also each cache can be tailored specific to need.
Processor I-Cache-1 Processor D-Cache-1

Unified Cache-1

Unified Cache-2

Unified Cache-2

Unified Cache

Split Cache
52

Example 4
Assume 16KB Instruction and Data Cache: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99% Assume 33% data ops 75% accesses from instructions (1.0/1.33) hit time=1, miss penalty=50 Data hit has 1 additional stall for unified cache (why?) Which is better (ignore L2 cache)? AMATSplit =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

53

Example 5 What is the impact of 2 different cache organizations on the performance of CPU? Clock cycle time 1nsec 50% load/store instructions Size of both caches 64KB: Both caches have block size 64KB one is direct mapped the other is 2-way sa. Cache miss penalty=75 ns for both caches Miss rate DM= 1.4% Miss rate SA=1% CPU cycle time must be stretched 25% to accommodate the multiplexor for the SA Solution AMAT DM= 1+(0.014*75)=2.05nsec AMAT SA=1*1.25+(0.01*75)=2ns CPU Time= IC*(CPI+(Misses/Instr)*Miss Penalty)* Clock cycle time CPU Time DM= IC*(2*1.0+(1.5*0.014*75)=3.58*IC CPU Time SA= IC*(2*1.25+(1.5*0.01*75)=3.63*IC Multilevel Cache The speed (hit time) of L1 cache affects the clock rate of CPU: Speed of L2 cache only affects miss penalty of L1. Inclusion Policy: Many designers keep L1 and L2 block sizes the same. Otherwise on a L2 miss, several L1 blocks may have to be invalidated. Multilevel Exclusion: L1 data never found in L2. AMD Athlon follows exclusion policy .

You might also like