Professional Documents
Culture Documents
Central Processing Unit Architecture: Architecture Overview Machine Organization Speeding Up CPU Operations
Central Processing Unit Architecture: Architecture Overview Machine Organization Speeding Up CPU Operations
(6.1)
Computer Architecture
Major components of a computer
Central Processing Unit (CPU) memory peripheral devices
(6.2)
(6.3)
(6.4)
bit byte = 8 bits (smallest addressable location) word = 4 bytes (typically; machine dependent)
(6.5)
(6.6)
ALU performs transfers between memory and I/O devices note two instructions per memory word
Arithmetic Logic Unit
main memory
InputOutput Equipment
(6.7)
(6.8)
Program control determines what computer does based on instruction read from memory MAR = memory address register holds address of memory cell to be read PC = program counter; address of next instruction to be read IR = instruction register holds instruction being executed IBR holds right half of instruction read from memory
(6.9)
PC MAR read M(MAR) into MBR copy left and right instructions into IR and IBR
Execute
address part of IR MAR read M(MAR) into MBR execute opcode
(6.10)
Architecture Families
Before mid-60s, every new machine had a different instruction set architecture
programs from previous generation didnt run on new machine cost of replacing software became too large single instruction set architecture wide range of price and performance with same software memory path width (1 byte to 8 bytes) faster, more complex CPU design greater I/O throughput and overlap
(6.11)
Architecture Families
(6.12)
(6.13)
(6.14)
Depending on mix of instructions and operand use, having many registers may lead to less traffic to memory and faster execution Most modern machines use a multiple register architecture
maximum number about 512, common number 32 integer, 32 floating point
Pipelining
One way to speed up CPU is to increase clock rate
limitations on how fast clock can run to complete instruction
(6.15)
Pipelining
(6.16)
Pipelining (continued)
Consider an example with 6 stages
FI = fetch instruction DI = decode instruction CO = calculate location of operand FO = fetch operand EI = execute instruction WO = write operand (store result)
(6.17)
Pipelining Example
(6.18)
Pipelining (continued)
Hazards to pipelining
conditional jump
instruction 3 branches to instruction 15 pipeline must be flushed and restarted
(6.19)
(6.20)
Real-life Problem
Not all instructions execute in one clock cycle
(6.21)
floating point takes longer than integer fp divide takes longer than fp multiply which takes longer than fp add typical values
integer add/subtract memory reference fp add fp (or integer) multiply fp (or integer) divide 1 1 2 (make 2 stages) 6 (make 2 stages) 15
Pipelining (continued)
(6.22)
This is not simple to implement note all 6 instructions could finish at the same time!!
More Speedup
Pipelined machines issue one instruction each clock cycle
how to speed up CPU even more?
(6.23)
Superscalar Architectures
(6.24)
Superscalar machines issue a variable number of instructions each clock cycle, up to some maximum
instructions must satisfy some criteria of independence
simple choice is maximum of one fp and one integer instruction per clock need separate execution paths for each possible simultaneous instruction issue
compiled code from non-superscalar implementation of same architecture runs unchanged, but slower
Superscalar Example
0 1 2 3 4 5 6 7 8
(6.25)
clock
Superscalar Problem
Instruction-level parallelism
what if two successive instructions cant be executed in parallel?
(6.26)
VLIW Architectures
(6.27)
Very Long Instruction Word (VLIW) architectures store several simple instructions in one long instruction fetched from memory
number and type are fixed
e.g., 2 memory reference, 2 floating point, one integer
VLIW Example
Memory Ref 1 Memory Ref 2 FP 1 FP 2
(6.28)
Integer
LD F0, 0(R1) LD F6, 8(R1) LD F10, 16(R1) LD F18,32(R1) LD F26,48(R1) LD F14, 24(R1) LD F22,40(R1) AD F4,F0,F2 AD F8,F6,F2 AD F12,F10,F2 AD F16,F14,F2 SB R1,R1,#4 8
(6.29)
Success of superscalar and VLIW machines depends on number of instructions that occur together that can be issued in parallel
no dependencies no branches
Compilers can help create parallelism Speculation techniques try to overcome branch problems
assume branch is taken execute instructions but dont let them store results until status of branch is known
(6.30)
(6.31)
IBM 70X, 70X0 series went from 24 opcodes to 185 in 10 years same time performance increased 30 times
Motivations are to
improve efficiency, since complex instructions can be implemented in hardware and execute faster make life easier for compiler writers support more complex higher-level languages
(6.32)
Examination of actual code indicated many of these features were not used RISC advocates proposed
simple, limited instruction set large number of general purpose registers
and mostly register operations
(6.33)
(6.34)