Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu

Modern Computer Architecture (Processor Design)
Prof. Dan Connors dconnors@colostate.edu
Computer Architecture
Historic definition Computer Architecture = Instruction Set Architecture + Computer Organization Famous architects: Wright, Fuller, Herzog Famous computer architects: Smith, Patt, Hwu, Hennessey, Patterson
Instruction Set Architecture

Important acronym: ISA
The low-level software interface to the machine

Assembly language of the machine Must translate any programming language into this language Examples: IA-32 (Intel instruction set), MIPS, SPARC, Alpha, PARISC, PowerPC,
Visible to programmer
Differences between ISAs

Much more is similar between ISAs than different. Compare MIPS & x86:
Instructions:
same basic types different names and variable-length encodings x86 branches use condition codes (like MIPS floating point) x86 supports (register + memory) -> (register) format
Registers:
Register-based architecture different number and names, x86 allows partial reads/writes
Memory:
Byte addressable, 32-bit address space x86 has additional addressing modes and many instruction types can access memory
RISC vs. CISC

MIPS was one of the first RISC architectures. It was started about 20 years ago by John Hennessy, one of the authors of our textbook. The architecture is similar to that of other RISC architectures, including Suns SPARC, IBM and Motorolas PowerPC, and ARM-based processors. Older processors used complex instruction sets, or CISC architectures.
Many powerful instructions were supported, making the assembly language programmers job much easier. But this meant that the processor was more complex, which made the hardware designers life harder.
Many new processors use reduced instruction sets, or RISC architectures.

Only relatively simple instructions are available. But with high-level languages and compilers, the impact on programmers is minimal. On the other hand, the hardware is much easier to design, optimize, and teach in classes.
Even most current CISC processors, such as Intel 8086-based chips, are now implemented using a lot of RISC techniques.
Tracing an Instruction's Execution

Step 1

Step 2
to controller

Step 3

Application Operating System Compiler Firmware Software
Instr. Set Proc. I/O system Datapath & Control Digital Design Circuit Design
Layout
Hardware
Computer Organization
Computer organization is the implementation of the machine, consisting of two components:
High-level organization
Memory system widths, bus structure, CPU internal organization,
Hardware
Precise logic and circuit design, mechanical packaging,
Many implementations are possible for an ISA !!!

Intel i386, i486, Pentium, Pentium II, Core2Duo, QuadCore Performance, complexity, cost differences.
Invisible to the programmer (mostly)!
Moores Law: 2x transistors every 18 months
Technology => dramatic change

Processor
logic capacity: about 30% increase per year clock rate: about 20% increase per year
Higher logic density gave room for instruction pipeline & cache Memory
DRAM capacity: Memory speed: Cost per bit: about 60% increase per year (4x every 3 years) about 10% increase per year about 25% improvement per year
Performance optimization no longer implies smaller programs Disk

Capacity: about 60% increase per year
Computers became lighter and more power efficient
Pentium Die Photo

3,100,000 transistors 296 mm2 60 MHz Introduced in 1993
1st superscalar implementation of IA32
Four Organization Concepts/Trends

Pipelining
Frequency
Cache memory
To keep processor running must overcome memory latency
Superscalar execution
Out of order execution
Multi-core
Pipelined Datapath
IF/ID ID/EX Zero Test Adata
datIn
EX/MEM
MEM/WB
Instr
15:0 20:0 25:21
Xtnd Xtnd << 2

regA regB datW datA aluA
Data Mem.
datOut addr
P C
20:16
Instr. Mem.
20:13
Reg. Array
datB
ALU
aluB
ALUout
regW
4:0 25:21
Wdest
+4
IncrPC
Wdata
Pipe Registers
Inserted between stages Labeled by preceding & following stage
Pipelining: Use Multiple Resources

Time (clock cycles)
I n s t r. O r d e r
ALU
Inst 0 Inst 1 Inst 2 Inst 3 Inst 4
Im
Reg
Dm ALU
Reg
Im
Reg
Dm ALU
Reg
Im
Reg
Dm ALU
Reg
Im
Reg
Dm ALU
Reg
Im
Reg
Dm
Reg
Conventional Pipelined Execution Representation

Time IFetch Dcd Exec Mem Exec WB Mem Exec WB Mem Exec WB Mem Exec WB Mem Exec WB Mem WB
IFetch Dcd
IFetch Dcd
IFetch Dcd
IFetch Dcd Program Flow
IFetch Dcd
Improve performance by increasing instruction throughput Ideal speedup? N Stages and I instructions
Technology Trends and Performance

1000 Logic DRAM
Speed
1000000 100000 Logic DRAM
Capacity
100
2 in 3 years
10
CPU speed and Memory speed grow apart
10000 1000 100 10 1
4 in 3 years 2 in 3 years
1.1 in 3 years
1
92
95
80
98
04
89
83
19 80 19 83 19 86 19 89 19 92 19 95 19 98 20 01 20 04 20 07
86
19
19
19
19
20
01
19
19
Computing capacity: 4 per 3 years

If we could keep all the transistors busy all the time Actual: 3.3 per 3 years
Moores Law: Performance is doubled every ~18

months
Trend is slowing: process scaling declines, power is up
19
20
20
07
Processor-DRAM Memory Gap (latency)
1000
Moores Law
CPU
Proc 60%/yr. (2X/1.5yr)
Performance
100 10
Processor-Memory Performance Gap: (grows 50% / year) DRAM 9%/yr. (2X/10 yrs)
DRAM
1
1989 1990 1992 1993 1994 1995 1996 1987 1988 1997 1998 1986 1991 1980 1981 1982 1983 1984 1985 2000 1999
Time
Cache Memory
Small, fast storage used to improve average access time to slow memory. Exploits spatial and temporal locality In computer architecture, almost everything is a cache!
Registers a cache on variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory)
Proc/Regs Bigger L1-Cache L2-Cache DRAM Memory Harddrive Faster
Modern Processor
This is an AMD Operton CPU Total Area: 193 mm2
Look at the relative sizes of each block: 50% cache 23% I/O 20% CPU logic + extra stuff
Floating-Point Unit Load/Store
Execution Unit
64 kB Data Cache
Bus Unit
Fetch Scan Align Micro-code
1 MB Unified Instruction/Data Level 2 Cache
64 kB Instr. Cache
Memory Controller HyperTransport DDR Memory Interface
CPUs dedicate > 50% area for cache memory. memory
CPI Cycles Per Instruction

CPUs work according to a clock signal
Clock cycle is measured in nsec (10-9 of a second) Clock frequency (= 1/clock cycle) measured in GHz
(109cyc/sec)
Instruction Count (IC)

Total number of instructions executed in the program
CPI = #cycles required to execute the program IC
CPI Cycles Per Instruction

Average #cycles per Instruction (in a given program) IPC (= 1/CPI) : Instructions per cycles
Relating time to system measures

Iron Law: Performance = 1/execution time Suppose that for some program we have: T seconds running time (the ultimate performance measure) C clock ticks, I instructions, P seconds/tick (performance measures of interest to the system designer) T secs = C ticks x P secs/tick = (I inst/I inst) x C ticks x P secs/tick T secs = I inst x (C ticks/I inst) x P secs/tick
running time instruction count avg clock ticks per instruction (CPI) clock period
Pentium 4 Block Diagram
Microprocessor Report
Out Of Order Execution

Look ahead in a window of instructions and find instructions that are ready to execute
Dont depend on data from previous instructions still not executed Resources are available
Out-of-order execution
Start instruction execution before execution of a previous instructions
Advantages:
Help exploit Instruction Level Parallelism (ILP) Help cover latencies (e.g., L1 data cache miss, divide)
Data Flow Analysis

Example:
(1) (2) (3) (4) (5) (6) r1 r8 r5 r6 r4 r7
1
r4 r1 r5 r6 r5 r8
/ + + + *
4
r7 ; assume divide takes 20 cycles r2 1 r3 In-order execution r6 1 2 r4 3 5 6

4 Out-of-order execution 1 3 5 2 6 4
Data Flow Graph r1

2 6
r5 r6
5
r8 r4
OOOE General Scheme

Fetch & Decode In-order Instruction pool Retire (commit) In-order Execute Out-of-order Fetch & decode instructions in parallel but in order, to fill inst. pool Execute ready instructions from the instructions pool
All the data required for the instruction is ready Execution resources are available
Once an instruction is executed

signal all dependant instructions that data is ready
Commit instructions in parallel but in-order

Can commit an instruction only after all preceding instructions (in program order) have committed
Out Of Order Execution Example

Assume that executing a divide operation takes 20 cycles
(1) (2) (3) (4) (5) r1 r3 r8 r3 r6 r5 r1 r5 r7 r6 / + + + r4 r8 1 2 r7
1 2 3 4 5
Inst2 has a REAL (flow) dependency on r1 with Inst1

It cannot be executed in parallel with Inst1
Can successive instructions pass Inst2 ?

Inst3 cannot since Inst2 must read r8 before Inst3 writes to it Inst4 cannot since it must write to r3 after Inst2 Inst5 can
Overcoming False Dependencies

OOOE creates new dependencies
WAR (write after read): write to a register which is read by an earlier inst.
(1) (2) r3 r2 + r1 r2 r4 + 3
WAW (write after write): write to a register which is written by an earlier inst.
(1) (2) r3 r1 + r2 r3 r4 + 3
These are false dependencies

There is no missing data Still prevent executing instructions out-of-order
Solution: Register Renaming
Pentium-4 Die Photo

EBL/BBL - Bus logic, Front, Back MOB - Memory Order Buffer Packed FPU - MMX Fl. Pt. (SSE) IEU - Integer Execution Unit FAU - Fl. Pt. Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Fl. Pt. RS - Reservation Station BTB - Branch Target Buffer IFU - Instruction Fetch Unit (+I$) ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer
1st Pentium4: 9.5 M transistors
An Exciting Time for CE - CMPs

Processor cores L2$
Shared L2$
L3$
Shared L3$
Monolithic uniprocessor
Chip multi-processor
Computer System Structure

External Graphics Card PCI express 16
Cache CPU
CPU BUS
North Bridge
On-board Graphics
Memory controller
DDRII
Mem BUS Channel 1
DDRII
Channel 2
South Bridge IO Controller

USB IDE SATA controller controller controller
PCI express 1
PCI Sound Card speakers Lan Adap
Parallel Port
Serial Port
Floppy Drive
keybrd
mouse
DVD Drive
Hard Disk
LAN

Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu

Uploaded by

Copyright:

Available Formats

Modern Computer Architecture (Processor Design)

Prof. Dan Connors dconnors@colostate.edu

Instruction Set Architecture

The low-level software interface to the machine

Differences between ISAs

RISC vs. CISC

Many new processors use reduced instruction sets, or RISC architectures.

Tracing an Instruction's Execution

Tracing an Instruction's Execution

Tracing an Instruction's Execution

Instruction Set Architecture

Instruction Set Architecture

Many implementations are possible for an ISA !!!

Invisible to the programmer (mostly)!

Moores Law: 2x transistors every 18 months

Technology => dramatic change

Performance optimization no longer implies smaller programs Disk

Computers became lighter and more power efficient

Pentium Die Photo

Four Organization Concepts/Trends

Xtnd Xtnd << 2

Pipelining: Use Multiple Resources

Inst 0 Inst 1 Inst 2 Inst 3 Inst 4

Conventional Pipelined Execution Representation

IFetch Dcd Program Flow

Technology Trends and Performance

1000000 100000 Logic DRAM

CPU speed and Memory speed grow apart

10000 1000 100 10 1

Computing capacity: 4 per 3 years

Moores Law: Performance is doubled every ~18

Processor-DRAM Memory Gap (latency)

Proc 60%/yr. (2X/1.5yr)

Fetch Scan Align Micro-code

1 MB Unified Instruction/Data Level 2 Cache

Memory Controller HyperTransport DDR Memory Interface

CPUs dedicate > 50% area for cache memory. memory

CPI Cycles Per Instruction

Instruction Count (IC)

CPI Cycles Per Instruction

Relating time to system measures

Pentium 4 Block Diagram

Out Of Order Execution

Data Flow Analysis

r7 ; assume divide takes 20 cycles r2 1 r3 In-order execution r6 1 2 r4 3 5 6

Data Flow Graph r1

OOOE General Scheme

Once an instruction is executed

Commit instructions in parallel but in-order

Out Of Order Execution Example

Inst2 has a REAL (flow) dependency on r1 with Inst1

Can successive instructions pass Inst2 ?

Overcoming False Dependencies

These are false dependencies

Solution: Register Renaming

Pentium-4 Die Photo

1st Pentium4: 9.5 M transistors

An Exciting Time for CE - CMPs

Computer System Structure

South Bridge IO Controller

PCI Sound Card speakers Lan Adap

You might also like